Ruby monkey patching pitfalls - ruby

I'm looking for examples of why it's not a good idea to extend base classes in ruby. I need to show some people why it's a weapon to be wielded carefully.
Any horror stories you can share?

There was a pretty famous example of monkey-patching going horribly wrong about 2.5 years ago in Rubinius.
The interesting thing about this case is that both the offending code and the victim were highly visible and highly unusual. Usually, the offender is some piece of code written by some PHP script kiddy who got drunk on his 1337 metaprogramming h4X0r skillz. And the failure mode is a simple ArgumentError exception, because the original method and the monkeypatch have different arity.
However, in this case, the offender was a library in the stdlib (mathn) and the failure mode was the Rubinius VM completely blowing up.
So, what happened? Well, mathn monkeypatches the Fixnum class and changes how Fixnum arithmetic works. In particular, it changes both the results and the types of several core methods. E.g.:
r = 4/3 # => 1
r.class # => Fixnum
require 'mathn'
r = 4/3 # => (4/3)
r.class # => Rational
The problem is of course that in Rubinius, the entire Ruby compiler, the entire Ruby kernel, large parts of the Ruby core library, some parts of the Rubinius VM and other parts of the Rubinius infrastructure, are all written in Ruby. And of course, all of those use Fixnum arithmetic all over the place.
The Hash class is written in Ruby and it uses Fixnum arithmetic to compute the size of the hash buckets, compute the hash function and so on. Array is written in Ruby and needs to compute element sizes and array lengths. The FFI library is written in Ruby and needs to compute memory addresses(!) and structure sizes. Many parts of Rubinius assume that they can do some Fixnum arithmetic and then pass the result to some C function as a pointer or int.
And since Ruby doesn't support any kind of selector namespacing or class boxing or similar (although something like that is planned for Ruby 2.0), as soon as some random user code requires the mathn library, all of those pieces just spectacularly explode, because all of a sudden, the result of a Fixnum operation is no longer a Fixnum (which is basically identical to a machine int and can be passed around as such), but a Rational (which is a full-fledged Ruby object).
Basically, what would happen, is that some code would require 'mathn' (or you would type that into IRb), and immediately the VM would just die.
The solution, in this case, was the safe math plugin for the compiler: when the compiler detects that it is compiling the kernel or other core parts of Rubinius, it automatically rewrites calls to Fixnum methods into calls to private immutable copies of those methods. [Note: I think in current versions of Rubinius, the problem is solved in a different way.]

The Trifecta of FAIL; or, how to patch Rails 2.0 for Ruby 1.8.7 has an example of Rails (which is a large, well-scrutinized project) causing problems because they monkeypatched String to add the method chars.

One obvious pitfall would be name collisions - if two or more packages choose the same name for a method that behaves differently.

Related

Can I change the encoding of a frozen String without copying it?

Can a String and its duplicate share the same underlying memory? Is there copy-on-write in Ruby?
I have a large, frozen String and I want to change its encoding. But I don't want to copy the whole String just to do that. For context, this is to pass values to a Google Protocol Buffer which has the bytes type and only accepts Encoding::ASCII_8BIT.
big_string.freeze
MyProtobuf::SomeMessage.new(
# I would prefer not to have to copy the whole string just to
# change the encoding.
value: big_string.dup.force_encoding(Encoding::ASCII_8BIT)
)
It seems to work just fine for me: (using MRI/YARV 1.9, 2.x, 3.x)
require 'objspace'
big_string = Random.bytes(1_000_000).force_encoding(Encoding::UTF_8)
big_string.encoding #=> #<Encoding:UTF-8>
big_string.bytesize #=> 1000000
ObjectSpace.memsize_of(big_string) #=> 1000041
dup_string = big_string.dup.force_encoding(Encoding::ASCII_8BIT)
dup_string.encoding #=> #<Encoding:ASCII-8BIT>
dup_string.bytesize #=> 1000000
ObjectSpace.memsize_of(dup_string) #=> 40
Those 40 bytes are the size to hold an object (RVALUE) in Ruby.
Note that instead of dup / force_encoding(Encoding::ASCII_8BIT) there's also b which returns a copy in binary encoding right away.
For more in-depth information, here's a blog post from 2012 (Ruby 1.9) about copy-on-write / shared strings in Ruby:
Seeing double: how Ruby shares string values
From the author's book Ruby Under a Microscope: (p. 265)
Internally, both JRuby and MRI use an optimization called copy-on-write for strings and other data. This trick allows two identical string values to share the same data buffer, which saves both memory and time because Ruby avoids making separate copies of the same string data unnecessarily.
Can a String and its duplicate share the same underlying memory? Is there copy-on-write in Ruby?
There is nothing in the Ruby Language Specification that prevents that. There is also nothing in the Ruby Language Specification that enforces that.
In general, the Ruby Language Specification tries to stay silent on all things related to memory management, space complexity, step complexity, or time complexity. This is not exclusive to the Ruby Language Specification, most Language Specifications try to leave the implementors as much leeway as possible. In other words, Language Specifications tend to specify Syntax and Semantics and leave the Pragmatics up to the implementor. (C++ is somewhat of an exception in that it specifies space and time complexity for the algorithms in the standard library.) Even C, which is typically thought of as a language which gives you full control over everything, doesn't actually specify things like memory layouts precisely – for example, due to the definition of the term width in the standard, a uint16_t is actually allowed to occupy more than 16 bits!
Every implementor is free to implement strings however they want, as long as they comply with the semantics defined in the Ruby Language Specification.
If I remember correctly, both Rubinius and TruffleRuby did, at one point, experiment with a String implementation based on Ropes. Chris Seaton, TruffleRuby's lead developer, wrote a paper about that implementation. However, I don't know if they are still using it. (I know TruffleRuby switched to Truffle Strings recently, and I am not sure what their underlying representation is … or whether they are even guaranteeing a specific underlying representation.)
There is problem with the answer "you have to look at the specification", though: unfortunately, unlike many other programming languages, the Ruby Language Specification does not exist as a single document in a single place. Ruby does not have a single formal specification that defines what certain language constructs mean.
There are several resources, the sum of which can be considered kind of a specification for the Ruby programming language.
Some of these resources are:
The ISO/IEC 30170:2012 Information technology — Programming languages — Ruby specification – Note that the ISO Ruby Specification was written around 2009–2010 with the specific goal that all existing Ruby implementations at the time would easily be compliant. Since YARV and MacRuby only implement Ruby 1.9+ and MRI only implements Ruby 1.8 and lower and JRuby, XRuby, Ruby.NET, and IronRuby (at the time) only implemented a subset of Ruby 1.8, this means that the ISO Ruby Specification only contains features that are common to both Ruby 1.8 and Ruby 1.9. Also, the ISO Ruby Specification was specifically intended to be minimal and only contain the features that are absolutely required for writing Ruby programs. Because of that, it does for example only specify Strings very broadly (since they have changed significantly between Ruby 1.8 and Ruby 1.9). It obviously also does not specify features which were added after the ISO Ruby Specification was written, such as Ractors or Pattern Matching.
The Ruby Spec Suite aka ruby/spec – Note that the ruby/spec is unfortunately far from complete. However, I quite like it because it is written in Ruby instead of "ISO-standardese", which is much easier to read for a Rubyist, and it doubles as an executable conformance test suite.
The Ruby Programming Language by David Flanagan and Yukihiro 'matz' Matsumoto – This book was written by David Flanagan together with Ruby's creator matz to serve as a Language Reference for Ruby.
Programming Ruby by Dave Thomas, Andy Hunt, and Chad Fowler – This book was the first English book about Ruby and served as the standard introduction and description of Ruby for a long time. This book also first documented the Ruby core library and standard library, and the authors donated that documentation back to the community.
The Ruby Issue Tracking System, specifically, the Feature sub-tracker – However, please note that unfortunately, the community is really, really bad at distinguishing between Tickets about the Ruby Programming Language and Tickets about the YARV Ruby Implementation: they both get intermingled in the tracker.
The Meeting Logs of the Ruby Developer Meetings. (Same problem: Ruby and YARV get intermingled.)
New features are often discussed on the mailing lists, in particular the ruby-core (English) and ruby-dev (Japanese) mailing lists. (Same problem again.)
The Ruby documentation – Again, be aware that this documentation is generated from the source code of YARV and does not distinguish between features of Ruby and features of YARV.
In the past, there were a couple of attempts of formalizing changes to the Ruby Specification, such as the Ruby Change Request (RCR) and Ruby Enhancement Proposal (REP) processes, both of which were unsuccessful.
If all else fails, you need to check the source code of the popular Ruby implementations to see what they actually do. Please note the plural: you have to look at multiple, ideally all, implementations to figure out what the consensus is. Only looking at one implementation cannot possibly tell you whether what you are looking at is an implementation quirk of this particular implementation or is a universally agreed-upon behavior of the Ruby Language.

What is the difference between Integer and Fixnum?

I know that the Fixnum class inherits from the Integer class. But what is the actual difference between them? Are there any use cases where we sometimes use Fixnum, and sometimes use Integer instead?
UPDATE: As of Ruby 2.4, the Fixnum and Bignum classes are gone, there is only Integer. The exact same optimizations still exist, but they are treated as "proper" compiler optimizations, i.e. behind the scenes, invisible to the programmer.
This is somewhat confusing. Integer is the real class that you should think about. Fixnum is basically a performance optimization that should never have been made visible to the programmer in the first place. (Compare this with flonums in YARV, which are implemented entirely as an optimization inside the VM, and never exposed to the programmer.)
Basically, Fixnums are fast and Bignums are slow(er), and the implementation automatically switches back and forth between them. You never ask for one of those directly, you will just get one or the other, depending on whether your integer fits into the restricted size of a Fixnum or not.
You never "use" Integer. It is an abstract class whose job is to endow its children (Fixnum and Bignum) with methods. Under effectively no circumstances will you ever ask for an object's class and be told that it is an Integer.

How does the ruby interpreter tell when you're calling a method on an integer or when you're setting a float?

I had a look through the 0.methods output in irb and couldn't see what path the ruby interpreter would take when it was passed 0.15 as opposed to 0.to_s
I've tried reading up on how ruby determines the difference between a floating point number being defined and a method being called on an integer but I haven't come to any conclusions.
The best guess I have is that because Ruby doesn't allow for a digit to lead a method name, it simply checks whether the character following the . is numeric or alphabetical.
I don't like guessing though, assumptions can lead to misunderstandings. Can someone clear this up for me?
How well can you read Yacc files? (Rhetorical question)
https://github.com/ruby/ruby/blob/trunk/parse.y#L7380 I believe this is where the Ruby parser handles floating point tokenisation.
Disclaimer: parse.y hurts my head.
As Methods in Ruby cannot begin with numbers, it's pretty easy to determine that 6.foo is a Method call and 6.12 is a Float.
You can distinguish both of them by pretty simple regular grammar specs, which is all what a Lexer needs to tokenize the source code.
I don't know for sure, but I think it is safe to assume that the two are distinguished by method names being unable to start with a number.
I don't see that it's an especially interesting or useful thing to know, and I think your curiosity is best directed elsewhere.

What will the major/minor differences be between ruby 1.9.2 and ruby 2.0?

I've been told that ruby 1.9.2 is ruby 2.0 but ruby 1.9.3 is slated to be released in the near future and it will contain some performance enhancements.
So what are they planning for 2.0? Will it be much different than ruby 1.9.x?
Two features that are already implemented in YARV, and which will most likely end up in Ruby 2.0, are traits (mix) and Module#prepend.
The mix method, unlike the current include method, takes a list of modules, and mixes all of them in at the same time, making sure that they have no conflicting methods. It also gives you a way to easily resolve conflicts, if e.g. two modules you want to mix in define the same method. So, basically, while the include method allows you to treat a module as a mixin, the mix method allows you to treat a module as a trait.
Module#prepend mixes a module into a class or module, again just like include does, but instead of inserting it into the inheritance chain just above the class, it inserts is just below the class. This means that methods in the module can override methods in the class, and they can delegate to the overriden methods with super, both of which is not possible when using include. This basically makes alias_method_chain obsolete.
One feature that has been discussed for a couple of months (or 10 years, depending on how you count), are Refinements. There has been discussion for over 10 years now to add a way to do scoped, safe monkey patching in Ruby. I.e. a way where I can monkey patch a core class, but only my code sees that monkey patch, other code doesn't. For many years, the frontrunner for that kind of safe monkey patching were Selector Namespaces, however more recently, Classboxes have been getting a lot of attention, and even more recently, a prototype implementation and specification of Refinements, a variant of Classboxes, was put forward.
Generally speaking, the big theme of Ruby 2.0 is scalability: scaling up to bigger teams, bigger codebases, bigger problem sizes, bigger machines, more cores. But also scaling down to smaller machines like embedded devices.
The three features I mentioned above are for scaling to bigger teams and bigger codebases. Some proposed features for scaling to bigger problem sizes and more cores are parallel collections and parallel implementations of Enumerable methods such as map, as well as better concurrency abstractions such as futures, promises, agents, actors, channels, join patterns or something like that.

In which versions of ruby are external iterator speeds improved?

According to this rubyquiz, external iterators used to be slow, but are now faster. Is this an improvement only available in YARV (the C-based implementation of ruby 1.9), or is this also available in the C-based implementation of ruby 1.8.7?
Also, does enum_for rely on external iterators?
Ruby 1.9 uses fibers to implement Enumerator#next, which might be better than Ruby 1.8, but still makes it an expensive call to make.
enum_for returns an Enumerator but does not rely on external iterators. A fiber/continuation will be created only if needed, i.e. if you call next but not if you call each or any other method inherited from Enumerable.
Rubinius and JRuby are optimizing next for the builtin types because it is very difficult to implement, in particular on the JVM. Fun bedtime reading: this thread on ruby-core
Rubinius also has some major performance enhancements, but it is a Ruby 1.8 implementation, not 1.9.

Resources