Immutable Nokogiri documents - ruby

Once I parse a complex HTML document into a Nokogiri::HTML object, I want to pass it around to various classes to extract information from it. Recently I discovered some code that accidentally mutated the original object:
body = Nokogiri::HTML(html_text_from_the_internets)
tables = body.css('table')
tables.each do {|table|
table.css('br').each do |line_break|
line_break.replace("\n")
end
end
All <br> tags are now removed from all tables in body
Is there are a way to freeze body so it can not be mutated? That way, during testing, hopefully, I can catch other side-effect mutations before they break things. Calling body.freeze doesn't work (probably because it only freezes the top-level object).

Yes, you can "deep freeze" a nested object.
A deep freeze is essentially a traversal that freezes all sub-objects.
You can write it yourself if you want, or use a gem such as ice_nine
Another way to test, that doesn't involve freezing, is to compare the overall object before and after. An easy way to do this is Marshall dump the object before, and after, and compare them. If the dumps are different, they will show you exactly what has changed. (Similarly, you could do a deep clone a.k.a. deep dup before, and use that for the comparison. Similarly, you could use an object hash.)

Related

Look up object from `inspect` value / pointer

As part of my debugging, I need to dump out a big object. Some of the data shows up like this:
#<Sketchup::Face:0x00007f9119bafea8>
This is the result of calling .inspect on the Sketchup::Face object, and Ruby calls .inspect internally when things are printed to the console.
Is it possible to then get a reference to the actual instance, using that 0x00007f9119bafea8 identifier? I don't even know exactly what this is - a pointer, instance id, something else?
Of course I can always manipulate my data before printing to console but since this is just for temporary debugging I'm hoping there's a quick and easy way.
Note: normally I would put in a binding.pry to avoid this whole business but due to Sketchup's restrictive programming environment it's not possible to use breakpoints there.

Is this a case of circular reference? Should I avoid it?

In the following code I have two classes. When the Nation class is instantiated to an object, it also instantiates an object for the Population class with a reference to the nation object.
class Nation
def initialize(name)
#name = name
#population = Population.new(self)
end
end
class Population
def initialize(nation)
#nation = nation
end
end
pry(main)> n = Nation.new("Germany")
=> #<Nation:0x0000000b3179e0 #name="Germany", #population=#<Population:0x0000000b3179b8 #nation=#<Nation:0x0000000b3179e0 ...>>>
Is this the case of circular reference?
Is it something that should be avoided?
Why is the Ruby interpreter not giving any errors? Isn't this leading to a kind of infinite recursion? When I create object n, it comes with a reference to object p, which comes with a reference to object n, which comes with a reference to object p... so how is the interpreter not going is some kind of infinite loop, as when you have a recursive function, which eventually terminates with a stack too deep error?
How could I refactor code like this where objects need to know about each others?
It's an old question, but I didn't see a good answer, so taking a crack at it.
Having two objects reference each other is not a problem, just like having two people point finger at each other is not a problem. However, if somebody tries to follow these pointers, never realizing that they are going back and forth, then it's a problem.
When you ran this code:
pry(main)> n = Nation.new("Germany")
You created 2 objects that point at each other, there is no problem. However, because you wrote the above line in an pry session, ruby tried to output the resulting object for you to see…
=> #<Nation:0x0000000b3179e0 #name="Germany", #population=#<Population:0x0000000b3179b8 #nation=#<Nation:0x0000000b3179e0 ...>>>
… and this is a problem. When ruby renders and object like that, it traverses all the instance variables in the object recursively, and prints them out. Since your variables have a circular reference, this traversing could make ruby go back and forth forever. That is, if ruby never realizes that it's going back and forth.
So why did it stop?
Ruby's inspect can realize it's going back and forth. When ruby recursively traverses objects, it keeps track of objects it's already seen. As soon as it encounters the same object twice, it stops and outputs the ... to prevent any further looping.
Are circular references to be avoided?
It depends on what you're doing, and with which libraries. The most common reason to traverse objects recursively (besides inspecting them) is serialization into JSON, YAML, etc.
If you are going to serialize objects, it's best to avoid circular references. There are some libraries out there that have clever techniques to serialize circular references, but if you can help it, avoid additional complexity. Serialization is complex enough as it is.
Bottom line: circular references are good for runtime convenience, and bad for recursively traversing or serializing objects. Use it like a sharp knife, with extra care.
Isn't this leading to a kind of infinite recursion?
Nope, there's no infinite loop/recursion. You create a nation, it creates a population and that's it. However, if, when created, population were to create a nation too, that would lead to an infinite recursion. But in this form, the code is fine.
Is this the case of circular reference?
In most langs, yes, but not in ruby. Why?, ruby is an interpreted lang. Ruby check if a class exist only when it's required, if, for example, when you run one of both initialize. Before run initialize, ruby only check if there is no syntax error only
Is it something that should be avoided?
No IMHO.

Why does delete return the deleted element instead of the new array?

In ruby, Array#delete(obj) will search and remove the specified object from the array. However, may be I'm missing something here but I found the returning value --- the obj itself --- is quite strange and a even a little bit useless.
My humble opinion is that in consistent with methods like sort/sort! and map/map! there should be two methods, e.g. delete/delete!, where
ary.delete(obj) -> new array, with obj removed
ary.delete!(obj) -> ary (after removing obj from ary)
For several reasons, first being that current delete is non-pure, and it should warn the programmer about that just like many other methods in Array (in fact the entire delete_??? family has this issue, they are quite dangerous methods!), second being that returning the obj is much less chainable than returning the new array, for example, if delete were like the above one I described, then I can do multiple deletions in one statement, or I can do something else after deletion:
ary = [1,2,2,2,3,3,3,4]
ary.delete(2).delete(3) #=> [1,4], equivalent to "ary - [2,3]"
ary.delete(2).map{|x|x**2"} #=> [1,9,9,9,16]
which is elegant and easy to read.
So I guess my question is: is this a deliberate design out of some reason, or is it just a heritage of the language?
If you already know that delete is always dangerous, there is no need to add a bang ! to further notice that it is dangerous. That is why it does not have it. Other methods like map may or may not be dangerous; that is why they have versions with and without the bang.
As for why it returns the extracted element, it provides access to information that is cumbersome to refer to if it were not designed like that. The original array after modification can easily be referred to by accessing the receiver, but the extracted element is not easily accessible.
Perhaps, you might be comparing this to methods that add elements, like push or unshift. These methods add elements irrespective of what elements the receiver array has, so returning the added element would be always the same as the argument passed, and you know it, so it is not helpful to return the added elements. Therefore, the modified array is returned, which is more helpful. For delete, whether the element is extracted depends on whether the receiver array has it, and you don't know that, so it is useful to have it as a return value.
For anyone who might be asking the same question, I think I understand it a little bit more now so I might as well share my approach to this question.
So the short answer is that ruby is not a language originally designed for functional programming, neither does it put purity of methods to its priority.
On the other hand, for my particular applications described in my question, we do have alternatives. The - method can be used as a pure alternative of delete in most situations, for example, the code in my question can be implemented like this:
ary = [1,2,2,2,3,3,3,4]
ary.-([2]).-([3]) #=> [1,4], or simply ary.-([2,3])
ary.-([2]).map{|x|x**2"} #=> [1,9,9,9,16]
and you can happily get all the benefits from the purity of -. For delete_if, I guess in most situations select (with return value negated) could be a not-so-great pure candidate.
As for why delete family was designed like this, I think it's more of a difference in point of view. They are supposed to be more of shorthands for commonly needed non-pure procedures than to be juxtaposed with functional-flavored select, map, etc.
I’ve wondered some of these same things myself. What I’ve largely concluded is that the method simply has a misleading name that carries with it false expectations. Those false expectations are what trigger our curiosity as to why the method works like it does. Bottom line—I think it’s a super useful method that we wouldn’t be questioning if it had a name like “swipe_at” or “steal_at”.
Anyway, another alternative we have is values_at(*args) which is functionally the opposite of delete_at in that you specify what you want to keep and then you get the modified array (as opposed to specifying what you want to remove and then getting the removed item).

Why isn't there a deep copy method in Ruby?

I am working on a solution for technical drawings (svg/ruby). I want to manipulate rectangles, and have an add! method in this class:
class Rect
def add!(delta)
#x1+=delta
... # and so on
self
end
end
I also need an add method returning a Rect, but not manipulating self:
def add(delta)
r=self.dup/clone/"copy" # <-- not realy the 3 and no quotes, just in text here
r.add! delta
end
dup and clone don't do my thing but:
def copy; Marshal.load(Marshal.dump(self)); end
does.
Why does such a basic functionality not exist in plain Ruby? Please just don't tell me that I could reverse add and add!, letting add do the job, and add! calling it.
I'm not sure why there's no deep copy method in Ruby, but I'll try to make an educated guess based on the information I could find (see links and quotes below the line).
Judging from this information, I could only infer that the reason Ruby does not have a deep copy method is because it's very rarely necessary and, in the few cases where it truly is necessary, there are other, relatively simple ways to accomplish the same task:
As you already know, using Marshal.dump and Marshal.load is currently the recommended way to do this. This is also the approach recommended by Programming Ruby (see excerpts below).
Alternatively, there are at least 3 available implementations found in these gems: deep_cloneable, deep_clone and ruby_deep_clone; the first being the most popular.
Related Information
Here's a discussion over at comp.lang.ruby which might shed some light on this. There's another answer here with some associated discussions, but it all comes back to using Marshal.
There weren't any mentions of deep copying in Programming Ruby, but there were a few mentions in The Ruby Programming Language. Here are a few related excerpts:
[…]
Another use for Marshal.dump and Marshal.load is to create deep copies
of objects:
def deepcopy(o)
Marshal.load(Marshal.dump(o))
end
[…]
… the binary format used by Marshal.dump and Marshal.load is
version-dependent, and newer versions of Ruby are not guaranteed to be
able to read marshalled objects written by older versions of Ruby.
[…]
Note that files and I/O streams, as well as Method and Binding
objects, are too dynamic to be marshalled; there would be no reliable
way to restore their state.
[…]
Instead of making a defensive deep copy of the array, just call
to_enum on it, and pass the resulting enumerator instead of the array
itself. In effect, you’re creating an enumerable but immutable proxy
object for your array.
Forget marshalling. The deep_dive gem will solve your problems.
https://rubygems.org/gems/deep_dive
Why can't you use something like this:
new_item = Item.new(old_item.attributes)
new_item.save!
This would copy all the attributes from existing item to new one, without issues. If you have other objects, you can just copy them individually.
I think it's the quickest way to copy an object

Adding a "source" attribute to ruby objects using Rubinius

I'm attempting to (for fun and profit) add the ability to inspect objects in ruby and discover their source code. Not the generated bytecode, and not some decompiled version of the internal representation, but the actual source that was parsed to create that object.
I was up quite late learning about Rubinius, and while I don't have my head around it yet fully, I think I've made some progress.
I'm having trouble figuring out how to do this, though. My first approach was to simply add another instance attribute to the AST structures (for, say, a ClosedScope object). Then, somehow pull that attribute out again when the bytecode is interpreted at runtime.
Does this seem like a sound approach?
As Mr Samuel says, you can just use pry and do show-source foo. But perhaps you'd like to know how it works under the hood.
Ruby provides two things that are useful: firstly you can get a list of all methods on an object. Just call foo.methods. Secondly it provides a file_name and line_number attribute for each method.
To find the entire source code for an object, we scan through all the methods and group them by where they are defined. We then scan up the file back until we see class or module or a few other ways rubyists use to define methods. We then scan forward in each file until we have identified the entire class/module definition.
As dgitized points out we often end up with multiple such definitions, if people have monkey patched core objects. By default pry only shows the module definition which contains most methods; but you can request the others with show-source -a.
Have you looked into Pry? It is a Ruby interpreter/debugger that claims to be able to do just what you've asked.
have you tried set_trace_func? It's not rubinius specific, but does what you ask and isn't based on pry or some other gem.
see http://www.ruby-doc.org/core-1.9.3/Kernel.html#method-i-set_trace_func

Resources