Ruby Regex: parsing C++ classes - ruby

I am curious about parsing C++ code using regexp. What I have so far (using ruby) allows me to extract class declarations and their parent classes (if any):
/(struct|class)\s+([^{:\s]+)\s*[:]?([^{]+)\s*\{/
Here is an example in Rubular. Notice I can capture correctly the "declaration" and "inheritance" parts.
The point at where I am stuck is at capturing the class body. If I use the following extension of the original regex:
/(struct|class)\s+([^{:\s]+)\s*[:]?([^{]+)\s*\{[^}]*\};/
Then I can capture the class body only if it does not contain any curly braces, and therefore any class or function definition.
At this point I have tried many things but none of them make this better.
For instance, if I include in the regexp the fact that the body can contain braces, it will capture the first class declaration and then all the subsequent classes as if they were part of the first class' body!
What am I missing?

Regular expressions are not the recommended way to parse code.
Most compilers and interpreters use lexers and parsers to convert code into an abstract syntax tree before compiling or running the code.
Ruby has a few lexer gems, like this, you can try and incorporate into your project.

The group capturing might help:
# named v backref v
/(struct|class)\s+(?<match>{((\g<match>|[^{}]*))*})/m
Here we find the matching curly bracket for the one following struct/class declaration. You probably will want to tune the regexp, I posted this to make the solution as clear as possible.

What I can offer you is this:
(struct|class)\s+([^{:\s]+)\s*[:]?([^{]+)\{([^{}]|\{\g<4>\})*\};
Where \g<4> is a recursive application of the fourth capture group, which is ([^{}]|\{\g<4>\}).
Matching non-regular languages with regular expressions is never pretty. You might want to consider switching to a proper recursive descent parser, especially if you plan to do something with the stuff you just captured.

Related

Why does the Rubocop default guidelines recommend parentheses in method definitions?

Why does Rubocop / the community-driven Ruby style guide recommend parentheses in method definitions?
def my_method(param1, param2)
end
# instead of
def my_method param1, param2
end
Method calls are allowed with or without parentheses depending on the situation. However, my first impression is that lack of parentheses in method calls are much more potentially ambiguous than lack of parentheses in method definitions. Was there a reason behind it, e.g. to make code more fool-proof, or did it happen because of "historical reasons" or "because it was the most widespread style"?
Clarification:
I am not asking for opinions about which style is easier to read.
The lint Lint/AmbiguousOperator is based on the idea that do_something *some_array is ambiguous and a source for bugs (Link). I wondered if this is the same case for Style/MethodDefParentheses (Link).
After going back to find the actual names of those Cops, my best guess right now is that there is no "technical" reason, but rather one is a proper "lint" and the other a "style" matter.
The rationale is omitted in the initial commit, of which this rule was part, indicating that there was no particular technical reason for it.
The fact that the corresponding cop is placed in the Style department, rather than Lint, serves as further proof that this is a matter of just that, style.
Method definitions have a very simple syntax. The def keyword is (optionally) followed by arguments, which must be followed by a terminator (newline or ;).
The possible variations are:
single line method definitions,
inline modifiers, e.g. private,
default- and keyword arguments,
splat- and block arguments.
All of these work fine both with and without parentheses. Furthermore, running a file with an unparenthesized method definition using the -w flag raises no warnings.
These factors together rule out the possibility that the parentheses are recommended to avoid ambiguity.

what is the use of erlang compile options: "-compile({parse_transform, ms_transform})".?

As the title, does anybody could explain the use of parse_transform with ms_transform?
what the different between with it and without it ?
The -compile({parse_transform, ms_transform}). syntax invokes a parse transform.
A parse transform is a module which the compiler calls after the file or input has been parsed. The module is called with the full abstract syntax of the whole module and must return a new abstract for a whole module. The parse transform is allowed to do whatever it wants as long as the result is legal erlang syntax. It is like a super macro facility which works on the whole module not just single function calls. The resulting module is then compiled. You can have many parse transforms.
Parse transforms are typically used to do compile-time evaluation and code transformations. The ets:fun2ms call mentioned by #P_A is a typical example of this as it takes a fun and at compile-time transforms this into a match specification, see Matchspecs and ets:fun2ms. But parse transforms allow you to do much more, for example add and remove functions. An example of this is a parse transform which generates access functions for all the fields in a record.
It is a very powerful tool, but unfortunately easy to get wrong and so create a real mess. There are, however, some 3rd party support tools which can be very helpful.
ms_transform module implements parse_transform that translates fun syntax into match specifications. For example ets:fun2ms fun uses it.
Also you can use
-include_lib("stdlib/include/ms_transform.hrl").

How does the ruby interperter parse double quoted strings

Background:
I am implementing a language similar to Ruby, called Sapphire, as a way to try out some Ideas I have on concurrency in programming languages. I am trying to copy Ruby's double quoted strings with embedded code which I find very useful as a programmer.
Question:
How do any of the Ruby interpreters turn a double quotes string with embedded code into and AST?
eg:
puts "The value of foo is #{#foo}."
puts "this is an example of unmatched braces in code: #{ foo.go('}') }"
Details:
The problem I have is how to decide which } closes the code block. Code blocks can have other braces within them and with a little effort they can be unmatched. The lexer can find the beginning of a code block in a string, but without the aid of the parser, it cannot know for sure which character is the end of that block.
It looks like Ruby's parse.y file does both the lexing and parsing steps, but reading that thing is a nightmare it is 11628 lines long with no comments and lots of abbr.
True, Yacc files can be a bit daunting to read at first and parse.y is not the best file to start with. Have you looked at the various string production rules? Do you have any specific questions?
As for the actual parsing, it's indeed not uncommon that lexers do also parse numeric literals and strings, see e.g. the accepted answer to a similar question here on SO. If you approach things this way, it's not too hard to see how to go about it. Hitting #{ inside a string, basically starts a new parsing context that gets parsed as an expression again. This means that the first } in your example can't be the terminating one for the interpolation, since it's part of a literal string within the expression. Once you reach the end of the expression (keep in mind expression separators like ;), the next } is the one you need.
This is not a complete answer, but I leave it in hopes that it might be useful either to me or one who follows me.
Matz gives a pretty detailed rundown of the yylex() function of parse.y in chapter 11 of his book. It does not directly mention strings, but it does describe how the lexer uses lex_state to resolve several locally ambiguous constructs in Ruby.
A reproduction of an English translation of this chapter can be found here.
Please bear in mind that they don't have to (create an AST at compile time).
Ruby strings can be assembled at runtime and will interpolate correctly. Therefore all the parsing and evaluation machinery has to be available at runtime. Any work done at compile time in that sense could be considered an optimisation.
So why does this matter? Because there are very effective stack-based techniques for parsing and evaluating expressions that do not create or decorate an AST. The string is read (parsed) from left to right, and as embedded tokens are encountered they are either evaluated or pushed on a stack, or cause stack contents to be popped and evaluated.
This is a simple technique to implement provided the expressions are relatively simple. If you really want the full power of the language inside every string, then you need the full compiler at runtime. Not everyone does.
Disclosure: I wrote a commercial language product that does exactly this.
Dart also supports expressions interpolated into strings like Ruby, and I've skimmed a few parsers for it. I believe what they do is define separate tokens for a string literal preceding interpolation and a string literal at the end. So if you tokenize:
"before ${the + expression} after"
You would get tokens like:
STRING_START "before "
IDENTIFIER the
PLUS
IDENTIFIER expression
STRING " after"
Then in your parser, it's a pretty straightforward process of handling STRING_START to parse the interpolated expression(s) following it.
Our Ruby parser (see my bio) treats Ruby "strings" as complex objects having lots of substructures, including string start and end tokens, bare string literal fragments, lots of funny punctuation sequences representing the various regexp operators, and of course, recursively, most of Ruby itself for expressions nested inside such strings.
This is accomplished by allowing the lexer to detect and generate such string fragments in a (for Ruby, many) special lexing modes. The parser has a (sub)grammar that defines valid sequences of tokens. And that kind of parsing solves OP's original problem; the parser knows whether a curly brace matches other curly braces from the regexp content, and/or if the regexp has been completely assembled and the curly brace is a matching block end.
Yes, it builds an AST of the Ruby code, and of the regexps.
The purpose of all this is to allow us to build analyzers and transformers of Ruby code. See https://softwarerecs.stackexchange.com/q/11779/101

What is the formal term for the "#{}" token in Ruby syntax?

The Background
I recently posted an answer where I variously referred to #{} as a literal, an operator, and (in one draft) a "literal constructor." The squishiness of this definition didn't really affect the quality of the answer, since the question was more about what it does and how to find language references for it, but I'm unhappy with being unable to point to a canonical definition of exactly what to call this element of Ruby syntax.
The Ruby manual mentions this syntax element in the section on expression substitution, but doesn't really define the term for the syntax itself. Almost every reference to this language element says it's used for string interpolation, but doesn't define what it is.
Wikipedia Definitions
Here are some Wikipedia definitions that imply this construct is (strictly speaking) neither a literal nor an operator.
Literal (computer programming)
Operator (programming)
The Questions
Does anyone know what the proper term is for this language element? If so, can you please point me to a formal definition?
Ruby's parser calls #{} the "embexpr" operator. That's EMBedded EXPRession, naturally.
I would definitely call it neither a literal (that's more for, e.g. string literals or number literals themselves, but not parts thereof) nor an operator; those are solely for e.g. binary or unary (infix) operators.
I would either just refer to it without a noun (i.e. for string interpolation), or perhaps call those characters the string interpolation sequence or escape.
TL;DR
Originally, I'd hypothesized:
Embedded expression seems the most likely definition for this token, based on hints in the source code.
This turned out to be true, and has been officially validated by the Ruby 2.x documentation. Based on the updates to the Ripper documentation since this answer was originally written, it seems the parser token is formally defined as string_embexpr and the symbol itself is called an "embedded expression." See the Update for Ruby 2.x section at the bottom of this answer for detailed corroboration.
The remainder of the answer is still relevant, especially for older Rubies such as Ruby 1.9.3, and the methodology used to develop the original answer remains interesting. I am therefore updating the answer, but leaving the bulk of the original post as-is for historical purposes, even though the current answer could now be shorter.
Pre-2.x Answer Based on Ruby 1.9.3 Source Code
Related Answer
This answer calls attention to the Ruby source, which makes numerous references to embexpr throughout the code base. #Phlip suggests that this variable is an abbreviation for "EMBedded EXPRession." This seems like a reasonable interpretation, but neither the ruby-1.9.3-p194 source nor Google (as of this writing) explicitly references the term embedded expression in association with embexpr in any context, Ruby-related or not.
Additional Research
A scan of the Ruby 1.9.3-p194 source code with:
ack-grep -cil --type-add=YACC=.y embexpr .rvm/src/ruby-1.9.3-p194 |
sort -rnk2 -t: |
sed 's!^.*/!!'
reveals 9 files and 33 lines with the term embexpr:
test_scanner_events.rb:12
test_parser_events.rb:7
eventids2.c:5
eventids1.c:3
eventids2table.c:2
parse.y:1
parse.c:1
ripper.y:1
ripper.c:1
Of particular interest is the inclusion of string_embexpr on line 4,176 of the parse.y and ripper.y bison files. Likewise, TestRipper::ParserEvents#test_string_embexpr contains two references to parsing #{} on lines 899 and 902 of test_parser_events.rb.
The scanner, exercised in test_scanner_events.rb, is also noteworthy. This file defines tests in #test_embexpr_beg and #test_embexpr_end that scan for the token #{expr} inside various string expressions. The tests reference both embexpr and expr, raising the likelihood that "embedded expression" is indeed a sensible name for the thing.
Update for Ruby 2.x
Since this post was originally written, the documentation for the standard library's Ripper class has been updated to formally identify the token. The usage section provides "Hello, #{world}!" as an example, and says in part:
Within our :string_literal you’ll notice two #tstring_content, this is the literal part for Hello, and !. Between the two #tstring_content statements is a :string_embexpr, where embexpr is an embedded expression.
This Block post suggests, it is called an 'idiom':
http://kconrails.com/2010/12/08/ruby-string-interpolation/
The Wikipedia Article doesn't seem to contradict that:
http://en.wikipedia.org/wiki/Programming_idiom
#{} It's called placeholder and is used to reference variables with a string.
puts "My name is #{my_name}"

Why aren't the arguments to File.new symbols instead of strings?

I was wondering why the people who wrote the File library decided to make the arguments that determine what mode the file is opened in strings instead of symbols.
For example, this is how it is now:
f = File.new('file', 'rw')
But wouldn't it be a better design to do
f = File.new('file', :rw)
or even
f = File.new(:file, :rw)
for example? This seems to be the perfect place to use them since the argument definitely doesn't need to be mutable.
I am interested in knowing why it came out this way.
Update: I just got done reading a related question about symbols vs. strings, and I think the consensus was that symbols are just not as well known as strings, and everyone is used to using strings to index hash tables anyway. However, I don't think it would be valid for the designers of Ruby's standard library to plead ignorance on the subject of symbols, so I don't think that's the reason.
I'm no expert in the history of ruby, but you really have three options when you want parameters to a method: strings, symbols, and static classes.
For example, exception handling. Each exception is actually a type of class Exception.
ArgumentError.is_a? Class
=> True
So you could have each permission for the stream be it's own class. But that would require even more classes to be generated for the system.
The thing about symbols is they are never deleted. Every symbol you generate is preserved indefinitely; it's why using the method '.to_sym' lightly is discouraged. It leads to memory leaks.
Strings are just easier to manipulate. If you got the input mode from the user, you would need a '.to_sym' somewhere in your code, or at the very least, a large switch statement. With a string, you can just pass the user input directly to the method (if you were so trusting, of course).
Also, in C, you pass a character to the file i/o method. There are no Chars in ruby, just strings. Seeing as how ruby is built on C, that could be where it comes from.
It is simply a relic from previous languages.

Resources