How does the ruby interperter parse double quoted strings - ruby

Background:
I am implementing a language similar to Ruby, called Sapphire, as a way to try out some Ideas I have on concurrency in programming languages. I am trying to copy Ruby's double quoted strings with embedded code which I find very useful as a programmer.
Question:
How do any of the Ruby interpreters turn a double quotes string with embedded code into and AST?
eg:
puts "The value of foo is #{#foo}."
puts "this is an example of unmatched braces in code: #{ foo.go('}') }"
Details:
The problem I have is how to decide which } closes the code block. Code blocks can have other braces within them and with a little effort they can be unmatched. The lexer can find the beginning of a code block in a string, but without the aid of the parser, it cannot know for sure which character is the end of that block.
It looks like Ruby's parse.y file does both the lexing and parsing steps, but reading that thing is a nightmare it is 11628 lines long with no comments and lots of abbr.

True, Yacc files can be a bit daunting to read at first and parse.y is not the best file to start with. Have you looked at the various string production rules? Do you have any specific questions?
As for the actual parsing, it's indeed not uncommon that lexers do also parse numeric literals and strings, see e.g. the accepted answer to a similar question here on SO. If you approach things this way, it's not too hard to see how to go about it. Hitting #{ inside a string, basically starts a new parsing context that gets parsed as an expression again. This means that the first } in your example can't be the terminating one for the interpolation, since it's part of a literal string within the expression. Once you reach the end of the expression (keep in mind expression separators like ;), the next } is the one you need.

This is not a complete answer, but I leave it in hopes that it might be useful either to me or one who follows me.
Matz gives a pretty detailed rundown of the yylex() function of parse.y in chapter 11 of his book. It does not directly mention strings, but it does describe how the lexer uses lex_state to resolve several locally ambiguous constructs in Ruby.
A reproduction of an English translation of this chapter can be found here.

Please bear in mind that they don't have to (create an AST at compile time).
Ruby strings can be assembled at runtime and will interpolate correctly. Therefore all the parsing and evaluation machinery has to be available at runtime. Any work done at compile time in that sense could be considered an optimisation.
So why does this matter? Because there are very effective stack-based techniques for parsing and evaluating expressions that do not create or decorate an AST. The string is read (parsed) from left to right, and as embedded tokens are encountered they are either evaluated or pushed on a stack, or cause stack contents to be popped and evaluated.
This is a simple technique to implement provided the expressions are relatively simple. If you really want the full power of the language inside every string, then you need the full compiler at runtime. Not everyone does.
Disclosure: I wrote a commercial language product that does exactly this.

Dart also supports expressions interpolated into strings like Ruby, and I've skimmed a few parsers for it. I believe what they do is define separate tokens for a string literal preceding interpolation and a string literal at the end. So if you tokenize:
"before ${the + expression} after"
You would get tokens like:
STRING_START "before "
IDENTIFIER the
PLUS
IDENTIFIER expression
STRING " after"
Then in your parser, it's a pretty straightforward process of handling STRING_START to parse the interpolated expression(s) following it.

Our Ruby parser (see my bio) treats Ruby "strings" as complex objects having lots of substructures, including string start and end tokens, bare string literal fragments, lots of funny punctuation sequences representing the various regexp operators, and of course, recursively, most of Ruby itself for expressions nested inside such strings.
This is accomplished by allowing the lexer to detect and generate such string fragments in a (for Ruby, many) special lexing modes. The parser has a (sub)grammar that defines valid sequences of tokens. And that kind of parsing solves OP's original problem; the parser knows whether a curly brace matches other curly braces from the regexp content, and/or if the regexp has been completely assembled and the curly brace is a matching block end.
Yes, it builds an AST of the Ruby code, and of the regexps.
The purpose of all this is to allow us to build analyzers and transformers of Ruby code. See https://softwarerecs.stackexchange.com/q/11779/101

Related

Invert Format spaces and break hints

Is it possible to invert the behavior of break hints in the Format module, e.g. using standard spaces as break hints, and adding some special notation for non-breakable spaces?
The current behavior leads to situations where one might be inclined to write "Hello# world,# this# is# a# short# phrase", where every space is converted into a break hint to mimic the behavior as seen e.g. in text editors, HTML renderers, etc.
For instance, this Using the Format module documentation explicitly recommends using break hints:
Generally speaking, a printing routine using "format", should not directly output white spaces: the routine should use break hints instead.
This behavior not only complicates writing messages, but it also makes it very hard to grep strings in the source code.
It seems that following the established convention that "every space is a break hint, unless marked as non-breaking space" would be a better alternative.
Are there simple techniques to wrap such strings and invert their behavior, preferably that incur no excessive runtime cost and/or lead to typing issues (e.g. due to conversions between string and format)?
Since 4.02 there is a Format.pp_print_text function that will take a regular text and print it substituting spaces with cuts and \n with new_lines.
Using this function you can still print using printf and other convenience functions:
printf "text '%a'" pp_print_text "hello, this a short phrase"
In general your question is more about library design. So, it is hard to answer anything about it. It is more suited for discussion on the OCaml mailing list.

Ruby Regex: parsing C++ classes

I am curious about parsing C++ code using regexp. What I have so far (using ruby) allows me to extract class declarations and their parent classes (if any):
/(struct|class)\s+([^{:\s]+)\s*[:]?([^{]+)\s*\{/
Here is an example in Rubular. Notice I can capture correctly the "declaration" and "inheritance" parts.
The point at where I am stuck is at capturing the class body. If I use the following extension of the original regex:
/(struct|class)\s+([^{:\s]+)\s*[:]?([^{]+)\s*\{[^}]*\};/
Then I can capture the class body only if it does not contain any curly braces, and therefore any class or function definition.
At this point I have tried many things but none of them make this better.
For instance, if I include in the regexp the fact that the body can contain braces, it will capture the first class declaration and then all the subsequent classes as if they were part of the first class' body!
What am I missing?
Regular expressions are not the recommended way to parse code.
Most compilers and interpreters use lexers and parsers to convert code into an abstract syntax tree before compiling or running the code.
Ruby has a few lexer gems, like this, you can try and incorporate into your project.
The group capturing might help:
# named v backref v
/(struct|class)\s+(?<match>{((\g<match>|[^{}]*))*})/m
Here we find the matching curly bracket for the one following struct/class declaration. You probably will want to tune the regexp, I posted this to make the solution as clear as possible.
What I can offer you is this:
(struct|class)\s+([^{:\s]+)\s*[:]?([^{]+)\{([^{}]|\{\g<4>\})*\};
Where \g<4> is a recursive application of the fourth capture group, which is ([^{}]|\{\g<4>\}).
Matching non-regular languages with regular expressions is never pretty. You might want to consider switching to a proper recursive descent parser, especially if you plan to do something with the stuff you just captured.

What is the formal term for the "#{}" token in Ruby syntax?

The Background
I recently posted an answer where I variously referred to #{} as a literal, an operator, and (in one draft) a "literal constructor." The squishiness of this definition didn't really affect the quality of the answer, since the question was more about what it does and how to find language references for it, but I'm unhappy with being unable to point to a canonical definition of exactly what to call this element of Ruby syntax.
The Ruby manual mentions this syntax element in the section on expression substitution, but doesn't really define the term for the syntax itself. Almost every reference to this language element says it's used for string interpolation, but doesn't define what it is.
Wikipedia Definitions
Here are some Wikipedia definitions that imply this construct is (strictly speaking) neither a literal nor an operator.
Literal (computer programming)
Operator (programming)
The Questions
Does anyone know what the proper term is for this language element? If so, can you please point me to a formal definition?
Ruby's parser calls #{} the "embexpr" operator. That's EMBedded EXPRession, naturally.
I would definitely call it neither a literal (that's more for, e.g. string literals or number literals themselves, but not parts thereof) nor an operator; those are solely for e.g. binary or unary (infix) operators.
I would either just refer to it without a noun (i.e. for string interpolation), or perhaps call those characters the string interpolation sequence or escape.
TL;DR
Originally, I'd hypothesized:
Embedded expression seems the most likely definition for this token, based on hints in the source code.
This turned out to be true, and has been officially validated by the Ruby 2.x documentation. Based on the updates to the Ripper documentation since this answer was originally written, it seems the parser token is formally defined as string_embexpr and the symbol itself is called an "embedded expression." See the Update for Ruby 2.x section at the bottom of this answer for detailed corroboration.
The remainder of the answer is still relevant, especially for older Rubies such as Ruby 1.9.3, and the methodology used to develop the original answer remains interesting. I am therefore updating the answer, but leaving the bulk of the original post as-is for historical purposes, even though the current answer could now be shorter.
Pre-2.x Answer Based on Ruby 1.9.3 Source Code
Related Answer
This answer calls attention to the Ruby source, which makes numerous references to embexpr throughout the code base. #Phlip suggests that this variable is an abbreviation for "EMBedded EXPRession." This seems like a reasonable interpretation, but neither the ruby-1.9.3-p194 source nor Google (as of this writing) explicitly references the term embedded expression in association with embexpr in any context, Ruby-related or not.
Additional Research
A scan of the Ruby 1.9.3-p194 source code with:
ack-grep -cil --type-add=YACC=.y embexpr .rvm/src/ruby-1.9.3-p194 |
sort -rnk2 -t: |
sed 's!^.*/!!'
reveals 9 files and 33 lines with the term embexpr:
test_scanner_events.rb:12
test_parser_events.rb:7
eventids2.c:5
eventids1.c:3
eventids2table.c:2
parse.y:1
parse.c:1
ripper.y:1
ripper.c:1
Of particular interest is the inclusion of string_embexpr on line 4,176 of the parse.y and ripper.y bison files. Likewise, TestRipper::ParserEvents#test_string_embexpr contains two references to parsing #{} on lines 899 and 902 of test_parser_events.rb.
The scanner, exercised in test_scanner_events.rb, is also noteworthy. This file defines tests in #test_embexpr_beg and #test_embexpr_end that scan for the token #{expr} inside various string expressions. The tests reference both embexpr and expr, raising the likelihood that "embedded expression" is indeed a sensible name for the thing.
Update for Ruby 2.x
Since this post was originally written, the documentation for the standard library's Ripper class has been updated to formally identify the token. The usage section provides "Hello, #{world}!" as an example, and says in part:
Within our :string_literal you’ll notice two #tstring_content, this is the literal part for Hello, and !. Between the two #tstring_content statements is a :string_embexpr, where embexpr is an embedded expression.
This Block post suggests, it is called an 'idiom':
http://kconrails.com/2010/12/08/ruby-string-interpolation/
The Wikipedia Article doesn't seem to contradict that:
http://en.wikipedia.org/wiki/Programming_idiom
#{} It's called placeholder and is used to reference variables with a string.
puts "My name is #{my_name}"

What are the pros and cons of Ruby's general delimited input? (percent syntax)

I don't understand why some people use the percentage syntax a lot in ruby.
For instance, I'm reading through the ruby plugin guide and it uses code such as:
%w{ models controllers }.each do |dir|
path = File.join(File.dirname(__FILE__), 'app', dir)
$LOAD_PATH << path
ActiveSupport::Dependencies.load_paths << path
ActiveSupport::Dependencies.load_once_paths.delete(path)
end
Every time I see something like this, I have to go and look up the percentage syntax reference because I don't remember what %w means.
Is that syntax really preferable to ["models", "controllers"].each ...?
I think in this latter case it's more clear that I've defined an array of strings, but in the former - especially to someone learning ruby - it doesn't seem as clear, at least for me.
If someone can tell me that I'm missing some key point here then please do, as I'm having a hard time understanding why the percent syntax appears to be preferred by the vast majority of ruby programmers.
One good use for general delimited input (as %w, %r, etc. are called) to avoid having to escape delimiters. This makes it especially good for literals with embedded delimiters. Contrast the regular expression
/^\/home\/[^\/]+\/.myprogram\/config$/
with
%r|^/home/[^/]+/.myprogram/config$|
or the string
"I thought John's dog was called \"Spot,\" not \"Fido.\""
with
%Q{I thought John's dog was called "Spot," not "Fido."}
As you read more Ruby, the meaning of general delimited input (%w, %r, &c.), as well as Ruby's other peculiarities and idioms, will become plain.
I believe that is no accident that Ruby often has several ways to do the same thing. Ruby, like Perl, appears to be a postmodern language: Minimalism is not a core values, but merely one of many competing design forces.
The %w syntax shaves 3 characters off each item in the list... can't beat that!
It's easy to remember: %w{} is for "words", %r{} for regexps, %q{} for "quotes", and so on... It's pretty easy once you build such memory aids.
As the size of the array grows, the %w syntax saves more and more keystrokes by not making you type in all the quotes and commas. At least that's the reason given in Learning Ruby.

What are some example use cases for symbol literals in Scala?

The use of symbol literals is not immediately clear from what I've read up on Scala. Would anyone care to share some real world uses?
Is there a particular Java idiom being covered by symbol literals? What languages have similar constructs? I'm coming from a Python background and not sure there's anything analogous in that language.
What would motivate me to use 'HelloWorld vs "HelloWorld"?
Thanks
In Java terms, symbols are interned strings. This means, for example, that reference equality comparison (eq in Scala and == in Java) gives the same result as normal equality comparison (== in Scala and equals in Java): 'abcd eq 'abcd will return true, while "abcd" eq "abcd" might not, depending on JVM's whims (well, it should for literals, but not for strings created dynamically in general).
Other languages which use symbols are Lisp (which uses 'abcd like Scala), Ruby (:abcd), Erlang and Prolog (abcd; they are called atoms instead of symbols).
I would use a symbol when I don't care about the structure of a string and use it purely as a name for something. For example, if I have a database table representing CDs, which includes a column named "price", I don't care that the second character in "price" is "r", or about concatenating column names; so a database library in Scala could reasonably use symbols for table and column names.
If you have plain strings representing say method names in code, that perhaps get passed around, you're not quite conveying things appropriately. This is sort of the Data/Code boundary issue, it's not always easy to the draw the line, but if we were to say that in that example those method names are more code than they are data, then we want something to clearly identify that.
A Symbol Literal comes into play where it clearly differentiates just any old string data with a construct being used in the code. It's just really there where you want to indicate, this isn't just some string data, but in fact in some way part of the code. The idea being things like your IDE would highlight it differently, and given the tooling, you could refactor on those, rather than doing text search/replace.
This link discusses it fairly well.
Note: Symbols will be deprecated and then removed in Scala 3 (dotty).
Reference: http://dotty.epfl.ch/docs/reference/dropped-features/symlits.html
Because of this, I personally recommend not using Symbols anymore (at least in new scala code). As the dotty documentation states:
Symbol literals are no longer supported
it is recommended to use a plain string literal [...] instead
Python mantains an internal global table of "interned strings" with the names of all variables, functions, modules, etc. With this table, the interpreter can make faster searchs and optimizations. You can force this process with the intern function (sys.intern in python3).
Also, Java and Scala automatically use "interned strings" for faster searchs. With scala, you can use the intern method to force the intern of a string, but this process don't works with all strings. Symbols benefit from being guaranteed to be interned, so a single reference equality check is both sufficient to prove equality or inequality.

Resources