Ruby WeakRef has implicit race condition? - ruby

I'm looking at Ruby WeakRef, and it seems that the way the API is written has an implied race condition, though it seems very unlikely to hit.
The basic usage implied by the API is:
obj = Object.new
foo = WeakRef.new(obj)
# Later on:
if (foo.weakref_alive?)
puts "I can allegedly use #{foo.to_s} now"
end
# Or even:
obj2 = foo.__getobj__ if foo.weakref_alive?
The problem lies in the fact that we don't have control over when garbage collection may happen, as an example, consider another thread running that regularly calls GC.start.
If we have garbage collection happening between the weakref_alive? check and the usage of the object, then we will end up hitting the RefError exception.
(I would actually expect that any large application that uses weakref - particularly those that are multithreaded - would hit RefErrors occasionally due to this)
I'm surprised there's no way to safely get the object in an atomic way if the object is available at the moment we check it.
So the question is first, am I overconcerned? Is there some reason we don't have to ever worry about a GC happening if we fetch the object right away after checking it (as in the second example)? And if not, then that gives us the second question, of the best way to safely work with weakrefs.
Right now I've added an 'obj' method to the class as one way to deal with it:
require 'weakref'
class WeakRef
def obj
begin
return self.__getobj__
rescue RefError
return nil
end
end
end
But unnecessary 'rescue' statements kind of bug me. I suppose we could also:
require 'weakref'
class WeakRef
def obj
savegc = GC.disable
obj = self.weakref_alive? ? self.__getobj__ : nil
GC.enable if savegc
return obj
end
end
But I'm skeptical that it's low-cost to just disable and re-enable the garbage collection, much less whether this is a completely atomic operation.
Any advice from ruby GC experts?

At first, please note the intended use for a WeakRef object, namely to stand in for the original object. Here, the WeakRef object implements the full duck-typed interface of the referenced object by forwarding all messages sent to it. As such, the WeakRef object is intended to be used directly in place of the original object (if it is still available).
While you may get a reference to the original object (if it is still available) with WeakRef#__getobj__, this is intended to be a special use-case and more of an implementation detail of the message delegation. If you do this however, you can check if the referenced object is still available with WeakRef#weakref_alive?. As you have noticed, there is the (at least theoretical) option for a race-condition, depending on your used Ruby implementation.
To be sure that you handle such race-conditions gracefully, you can indeed rescue the RefError if it occurs. You can just optimize the non-race-condition case a bit:
obj = Object.new
foo = WeakRef.new(obj)
begin
obj2 = foo.__getobj__ if foo.weakref_alive?
rescue RefError
obj2 = nil
end
You can use the same pattern for any other message sent to your weak reference (which then gets forwarrdr to your referenced object), e.g.
begin
foo.to_s if foo.weakref_alive?
rescue RefError
# # do nothing as foo is a dangling reference to a garbage-collected object
end
Depending on your use-case, this may be a bit awkward though. Also, sometimes it is necessary to have the actual object reference rather than a wrapped object (which may behave differently when inquired about its specific class., e.g. in a case statement).
Here, an option could be to use ObjectSpace::WeakMap instead of the WeakRef. This class is used internally by WeakRef to actually hold the weak references. Ruby actually discourages the use of this class and regards it as an internal class. However, I found it to be useful to implement a more straight-forward lookup than with just WeakRef. Just be aware that the behavior in this area might subtly change and it might be a good idea to read changelogs as you update your Ruby versions.
With that out of the way, a sample lookup with ObjectSpace::WeakMap could look like this:
# The WeakMap object which can store multiple maps from an
# existing object to another (potentially garbage-collected) object.
# If you need multiple weak references, you can still use
# a single map.
WEAK_MAP = ObjectSpace::WeakMap.new
# Our referenced object which may or may not be garbage-collected later
obj = Object.new
# The "marker" object is the key in map. It is used to look the reference
# to the intended object. You need to always use the same object here
# (rather than e.g. a similar string) as the actual object_id of the marker
# is used for the lookup of the referenced object
marker = Object.new
# Store a reference in the weak map
WEAK_MAP[marker] = obj
#########################################################
# Now do something else... #
# obj may be garbage-collected in the meantime. #
# You need to hold onto the marker object though! #
#########################################################
# Now, you can retrieve a reference to the actual
# original object (if it is still available) or nil
# if obj was already garbage-collected
obj2 = WEAK_MAP[marker]
As written above, the WeakRef class uses exactly this mechanism internally. Here, the WeakRef object uses itself as the marker. That is, as long as you hold the actual WeakRef object. The simplified lookup in WeakRef#__getobj__ thus looks like this:
class WeakRef
WEAK_MAP = ObjectSpace::WeakMap.new
def __getobj__
WEAK_MAP[self] || raise RefError, "Invalid Reference"
end
def weakref_alive?
!WEAK_MAP[self].nil?
# actually, it's this mostly equivalent code
# WEAK_MAP.key?(self)
end
end
You can find the implementation of the WeakRef class at https://github.com/ruby/ruby/blob/master/lib/weakref.rb - have a look, it's actually quite readable.

Related

How to Make Or Reference a Null Ruby Binding For Eval

Rubocop dislikes the following; it issues Pass a binding, __FILE__ and __LINE__ to eval.:
sort_lambda = eval "->(a) { a.date }"
Yes, I know that eval is a security problem. The issue of security is out of scope for this question.
The Ruby documentation on binding says:
Objects of class Binding encapsulate the execution context at some particular place in the code and retain this context for future use. The variables, methods, value of self, and possibly an iterator block that can be accessed in this context are all retained. Binding objects can be created using Kernel#binding, and are made available to the callback of Kernel#set_trace_func and instances of TracePoint.
These binding objects can be passed as the second argument of the Kernel#eval method, establishing an environment for the evaluation.
The lambda being created does not need to access any variables in any scopes.
A quick and dirty binding to the scope where the eval is invoked from would look like this:
sort_lambda = eval "->(a) { a.date }", self.binding, __FILE__, __LINE__
Ideally, a null binding (a binding without anything defined in it, nothing from self, etc.) should be passed to this eval instead.
How could this be done?
Not exactly, but you can approximate it.
Before I go further, I know you've already said this, but I want to emphasize it for future readers of this question as well. What I'm describing below is NOT a sandbox. This will NOT protect you from malicious users. If you pass user input to eval, it can still do a lot of damage with the binding I show you below. Consult a cybersecurity expert before trying this in production.
Great, with that out of the way, let's move on. You can't really have an empty binding in Ruby. The Binding class is sort of compile-time magic. Although the class proper only exposes a way to get local variables, it also captures any constant names (including class names) that are in scope at the time, as well as the current receiver object self and all methods on self that can be invoked from the point of execution. The problem with an empty binding is that Ruby is a lot like Smalltalk sometimes. Everything exists in one big world of Platonic ideals called "objects", and no Ruby code can truly run in isolation.
In fact, trying to do so is really just putting up obstacles and awkward goalposts. Think you can block me from accessing BasicObject? If I have literally any object a in Ruby, then a.class.ancestors.last is BasicObject. Using this technique, we can get any global class by simply having an instance of that class or a subclass. Once we have classes, we have modules, and once we have modules we have Kernel, and at that point we have most of the Ruby built-in functionality.
Likewise, self always exists. You can't get rid of it. It's a fundamental part of the Ruby object system, and it exists even in situations where you don't think it does (see this question of mine from awhile back, for instance). Every method or block of code in Ruby has a receiver, so the most you can do is try to limit the receiver to be as small an object as possible. One might think you want self to be BasicObject, but amusingly there's not really a way to do that either, since you can only get a binding if Kernel is in scope, and BasicObject doesn't include Kernel. So at minimum, you're getting all of Kernel. You might be able to skimp by somehow and use some subclass of BasicObject that includes Kernel, thereby avoiding other Object methods, but that's likely to cause confusion down the road too.
All of this is to emphasize that a hypothetical null binding would really only make it slightly more complicated to get all of the global names, not impossible. And that's why it doesn't exist.
That being said, if your goal is to eliminate local variables and to try, you can get that easily by creating a binding inside of a module.
module F
module_function def get_binding
binding
end
end
sort_lambda = eval "->(a) { a.date }", F.get_binding
This binding will never have local variables, and the methods and constants it has access to are limited to those available in Kernel or at the global scope. That's about as close to "null" as you're going to get in the complex nexus of interconnected types and names we call Ruby.
While I originally left this as a comment on #Silvio Mayolo's answer, which is very well written, it seems germane to post it as an answer instead.
While most of what is contained within that answer is correct we can get slightly closer to a "Null Binding" through BasicObject inheritance:
class NullBinding < BasicObject
def get_binding
::Kernel
.instance_method(:binding)
.bind(self)
.call
end
end
This binding context has as limited a context as possible in ruby.
Using this context you will be unable to reference constants solely by name:
eval 'Class', NullBinding.new.get_binding
#=> NameError
That being said you can still reference the TOP_LEVEL scope so
eval '::Class', NullBinding.new.get_binding
#=> Class
The methods directly available in this binding context are limited only to the instance methods available to BasicObject. By way of Example:
eval "puts 'name'", NullBinding.new.get_binding
#=> NoMethodError
Again with the caveat that you can access TOP_LEVEL scope so:
eval "::Kernel.puts 'name'", NullBinding.new.get_binding
# name
#=> nil

Ruby: understanding data structure

Most of the Factorybot factories are like:
FactoryBot.define do
factory :product do
association :shop
title { 'Green t-shirt' }
price { 10.10 }
end
end
It seems that inside the ":product" block we are building a data structure, but it's not the typical hashmap, the "keys" are not declared through symbols and commas aren't used.
So my question is: what kind of data structure is this? and how it works?
How declaring "association" inside the block doesn't trigger a:
NameError: undefined local variable or method `association'
when this would happen on many other situations. Is there a subject in compsci related to this?
The block is not a data structure, it's code. association and friends are all method calls, probably being intercepted by method_missing. Here's an example using that same technique to build a regular hash:
class BlockHash < Hash
def method_missing(key, value=nil)
if value.nil?
return self[key]
else
self[key] = value
end
end
def initialize(&block)
self.instance_eval(&block)
end
end
With which you can do this:
h = BlockHash.new do
foo 'bar'
baz :zoo
end
h
#=> {:foo=>"bar", :baz=>:zoo}
h.foo
#=> "bar"
h.baz
#=> :zoo
I have not worked with FactoryBot so I'm going to make some assumptions based on other libraries I've worked with. Milage may vary.
The basics:
FactoryBot is a class (Obviously)
define is a static method in FactoryBot (I'm going to assume I still haven't lost you ;) ).
Define takes a block which is pretty standard stuff in ruby.
But here's where things get interesting.
Typically when a block is executed it has a closure relative to where it was declared. This can be changed in most languages but ruby makes it super easy. instance_eval(block) will do the trick. That means you can have access to methods in the block that weren't available outside the block.
factory on line 2 is just such a method. You didn't declare it, but the block it's running in isn't being executed with a standard scope. Instead your block is being immediately passed to FactoryBot which passes it to a inner class named DSL which instance_evals the block so its own factory method will be run.
line 3-5 don't work that way since you can have an arbitrary name there.
ruby has several ways to handle missing methods but the most straightforward is method_missing. method_missing is an overridable hook that any class can define that tells ruby what to do when somebody calls a method that doesn't exist.
Here it's checking to see if it can parse the name as an attribute name and use the parameters or block to define an attribute or declare an association. It sounds more complicated than it is. Typically in this situation I would use define_method, define_singleton_method, instance_variable_set etc... to dynamically create and control the underlying classes.
I hope that helps. You don't need to know this to use the library the developers made a domain specific language so people wouldn't have to think about this stuff, but stay curious and keep growing.

Weird Ruby class initialization logic?

Some open source code I'm integrating in my application has some classes that include code to that effect:
class SomeClass < SomeParentClass
def self.new(options = {})
super().tap { |o|
# do something with `o` according to `options`
}
end
def initialize(options = {})
# initialize some data according to `options`
end
end
As far as I understand, both self.new and initialize do the same thing - the latter one "during construction" and the former one "after construction", and it looks to me like a horrible pattern to use - why split up the object initialization into two parts where one is obviously "The Wrong Think(tm)"?
Ideally, I'd like to see what is inside the super().tap { |o| block, because although this looks like bad practice, just maybe there is some interaction required before or after initialize is called.
Without context, it is possible that you are just looking at something that works but is not considered good practice in Ruby.
However, maybe the approach of separate self.new and initialize methods allows the framework designer to implement a subclass-able part of the framework and still ensure setup required for the framework is completed without slightly awkward documentation that requires a specific use of super(). It would be a slightly easier to document and cleaner-looking API if the end user gets functionality they expect with just the subclass class MyClass < FrameworkClass and without some additional note like:
When you implement the subclass initialize, remember to put super at the start, otherwise the magic won't work
. . . personally I'd find that design questionable, but I think there would at least be a clear motivation.
There might be deeper Ruby language reasons to have code run in a custom self.new block - for instance it may allow constructor to switch or alter the specific object (even returning an object of a different class) before returning it. However, I have very rarely seen such things done in practice, there is nearly always some other way of achieving the goals of such code without customising new.
Examples of custom/different Class.new methods raised in the comments:
Struct.new which can optionally take a class name and return objects of that dynamically created class.
In-table inheritance for ActiveRecord, which allows end user to load an object of unknown class from a table and receive the right object.
The latter one could possibly be avoided with a different ORM design for inheritance (although all such schemes have pros/cons).
The first one (Structs) is core to the language, so has to work like that now (although the designers could have chosen a different method name).
It's impossible to tell why that code is there without seeing the rest of the code.
However, there is something in your question I want to address:
As far as I understand, both self.new and initialize do the same thing - the latter one "during construction" and the former one "after construction"
They do not do the same thing.
Object construction in Ruby is performed in two steps: Class#allocate allocates a new empty object from the object space and sets its internal class pointer to self. Then, you initialize the empty object with some default values. Customarily, this initialization is performed by a method called initialize, but that is just a convention; the method can be called anything you like.
There is an additional helper method called Class#new which does nothing but perform the two steps in sequence, for the programmer's convenience:
class Class
def new(*args, &block)
obj = allocate
obj.send(:initialize, *args, &block)
obj
end
def allocate
obj = __MagicVM__.__allocate_an_empty_object_from_the_object_space__
obj.__set_internal_class_pointer__(self)
obj
end
end
class BasicObject
private def initialize(*) end
end
The constructor new has to be a class method since you start from where there is no instance; you can't be calling that method on a particular instance. On the other hand, an initialization routine initialize is better defined as an instance method because you want to do something specifically with a certain instance. Hence, Ruby is designed to internally call the instance method initialize on a new instance right after its creation by the class method new.

How can I determine what objects a call to ruby require added to the global namespace?

Suppose I have a file example.rb like so:
# example.rb
class Example
def foo
5
end
end
that I load with require or require_relative. If I didn't know that example.rb defined Example, is there a list (other than ObjectSpace) that I could inspect to find any objects that had been defined? I've tried checking global_variables but that doesn't seem to work.
Thanks!
Although Ruby offers a lot of reflection methods, it doesn't really give you a top-level view that can identify what, if anything, has changed. It's only if you have a specific target you can dig deeper.
For example:
def tree(root, seen = { })
seen[root] = true
root.constants.map do |name|
root.const_get(name)
end.reject do |object|
seen[object] or !object.is_a?(Module)
end.map do |object|
seen[object] = true
puts object
[ object.to_s, tree(object, seen) ]
end.to_h
end
p tree(Object)
Now if anything changes in that tree structure you have new things. Writing a diff method for this is possible using seen as a trigger.
The problem is that evaluating Ruby code may not necessarily create all the classes that it will or could create. Ruby allows extensive modification to any and all classes, and it's common that at run-time it will create more, or replace and remove others. Only libraries that forcibly declare all of their modules and classes up front will work with this technique, and I'd argue that's a small portion of them.
It depends on what you mean by "the global namespace". Ruby doesn't really have a "global" namespace (except for global variables). It has a sort-of "root" namespace, namely the Object class. (Although note that Object may have a superclass and mixes in Kernel, and stuff can be inherited from there.)
"Global" constants are just constants of Object. "Global functions" are just private instance methods of Object.
So, you can get reasonably close by examining global_variables, Object.constants, and Object.instance_methods before and after the call to require/require_relative.
Note, however, that, depending on your definition of "global namespace" (private) singleton methods of main might also count, so you check for those as well.
Of course, any of the methods the script added could, when called at a later time, themselves add additional things to the global scope. For example, the following script adds nothing to the scope, but calling the method will:
class String
module MyNonGlobalModule
def self.my_non_global_method
Object.const_set(:MY_GLOBAL_CONSTANT, 'Haha, gotcha!')
end
end
end
Strictly speaking, however, you asked about adding "objects" to the global namespace, and neither constants nor methods nor variables are objects, soooooo … the answer is always "none"?

instance_variable_set in constructor

I've made a constructor like this:
class Foo
def initialize(p1, p2, opts={})
#...Initialize p1 and p2
opts.each do |k, v|
instance_variable_set("##{k}", v)
end
end
end
I'm wondering if it's a good practice to dynamically set instance variables like this or if I should better set them manually one by one as in most of the libs, and why.
Diagnosing the problem
What you're doing here is a fairly simple example of metaprogramming, i.e. dynamically generating code based on some input. Metaprogramming often reduces the amount of code you need to write, but makes the code harder to understand.
In this particular case, it also introduces some coupling concerns: the public interface of the class is directly related to the internal state in a way that makes it hard to change one without changing the other.
Refactoring the example
Consider a slightly longer example, where we make use of one of the instance variables:
class Foo
def initialize(opts={})
opts.each do |k, v|
instance_variable_set("##{k}", v)
end
end
def greet(name)
greeting = #greeting || "Hello"
puts "#{greeting}, name"
end
end
Foo.new(greeting: "Hi").greet
In this case, if someone wanted to rename the #greeting instance variable to something else, they'd possibly have a hard time understanding how to do that. It's clear that #greeting is used by the greet method, but searching the code for #greeting wouldn't help them find where it was first set. Even worse, to change this bit of internal state they'd also have to change any calls to Foo.new, because the approach we've taken ties the internal state to the public interface.
Remove the metaprogramming
Let's look at an alternative, where we just store all of the opts and treat them as state:
class Foo
def initialize(opts={})
#opts = opts
end
def greet(name)
greeting = #opts.fetch(:greeting, "Hello")
puts "#{greeting}, name"
end
end
Foo.new(greeting: "Hi").greet
By removing the metaprogramming, this clarifies the situation slightly. A new team member who's looking to change this code for the first time is going to have a slightly easier time of things, because they can use editor features (like find-and-replace) to rename the internal ivars, and the relationship between the arguments passed to the initialiser and the internal state is a bit more explicit.
Reduce the coupling
We can go even further, and decouple the internals from the interface:
class Foo
def initialize(opts={})
#greeting = opts.fetch(:greeting, "Hello")
end
def greet(name)
puts "#{#greeting}, name"
end
end
Foo.new(greeting: "Hi").greet
In my opinion, this is the best implementation we've looked at:
There's no metaprogramming, which means we can find explicit references to variables being set and used, e.g. with an editor's search features, grep, git log -S, etc.
We can change the internals of the class without changing the interface, and vice-versa.
By calling opts.fetch in the initialiser, we're making it clear to future readers of our class what the opts argument should look like, without making them read the whole class.
When to use metaprogramming
Metaprogramming can sometimes be useful, but those situations are rare. As a rough guide, I'd be more likely to use metaprogramming in framework or library code which typically needs to be more generic (e.g. the ActiveModel::AttributeAssignment module in Rails), and to avoid it in application code, which is typically more specific to a particular problem or domain.
Even in library code, I'd prefer the clarity of a few lines of repetition.
Answers to this question are always going to be based on someone's personal opinion so here's mine.
Clarity v Brevity
If you cannot know the set of options ahead of time then you have no real choice but to do as you have. However if the options are drawn from a known set then I would favour clarity over brevity and have explicit methods to set the options. These would also be a good place to add any rdoc etc.
Safety
From a safety perspective, having methods to handle the setting of an option would allow you to perform validation as required.
When you need to do such thing, the inventory of the parameters varies. In such case, there is already a handy structure within Ruby (as well as most modern languages): array and hash. In this case, you should just save the entire option as a single hash. That would make things simpler.
Instead of creating instance variables dynamically, you could use attr_accessor to declare the available instance variables and just call the setters dynamically:
class Foo
attr_accessor :bar, :baz, :qux
def initialize(opts = {})
opts.each do |k, v|
public_send("#{k}=", v)
end
end
end
Foo.new(bar: 1, baz: 2) #=> #<Foo:0x007fa8250a31e0 #bar=1, #baz=2>
Foo.new(qux: 3) #=> #<Foo:0x007facbc06ed50 #qux=3>
This approach also shows an error if an unknown option is passed:
Foo.new(quux: 4) #=> undefined method `quux=' for #<Foo:0x007fd71483aa20> (NoMethodError)

Resources