Object address in Ruby - ruby

Short version: The default inspect method for a class displays the object's address.* How can I do this in a custom inspect method of my own?
*(To be clear, I want the 8-digit hex number you would normally get from inspect. I don't care about the actual memory address. I'm just calling it a memory address because it looks like one. I know Ruby is memory-safe.)
Long version: I have two classes, Thing and ThingList. ThingList is a subclass of Array specifically designed to hold Things. Due to the nature of Things and the way they are used in my program, Things have an instance variable #container that points back to the ThingList that holds the Thing.
It is possible for two Things to have exactly the same data. Therefore, when I'm debugging the application, the only way I can reliably differentiate between two Things is to use inspect, which displays their address. When I inspect a Thing, however, I get pages upon pages of output because inspect will recursively inspect #container, causing every Thing in the list to be inspected as well!
All I need is the first part of that output. How can I write a custom inspect method on Thing that will just display this?
#<Thing:0xb7727704>
EDIT: I just realized that the default to_s does exactly this. I didn't notice this earlier because I have a custom to_s that provides human-readable details about the object.
Assume that I cannot use to_s, and that I must write a custom inspect.

You can get the address using object_id and multiplying it by 2* and display it in hex using sprintf (aka %):
"#<Thing:0x%08x>" % (object_id * 2)
Of course, as long as you only need the number to be unique and don't care that it's the actual address, you can just leave out the * 2.
* For reasons that you don't need to understand (meaning: I don't understand them), object_id returns half the object's memory address, so you need to multiply by 2 to get the actual address.

This is impossible. There is no way in Ruby to get the memory address of an object, since Ruby is a memory-safe language which has (by design) no methods for accessing memory directly. In fact, in many implementations of Ruby, objects don't even have a memory address. And in most of the implementations that do map objects directly to memory, the memory address potentially changes after every garbage collection.
The reason why using the memory address as an identifier in current versions of MRI and YARV accidentally works, is because they have a crappy garbage collector implementation that never defragments memory. All other implementations have garbage collectors which do defragment memory, and thus move objects around in memory, thereby changing their address.
If you tie your implementation to the memory address, your code will only ever work on slow implementations with crappy garbage collectors. And it isn't even guaranteed that MRI and YARV will always have crappy garbage collectors, in fact, in both implementations the garbage collector has been identified as one of the major performance bottlenecks and it is safe to assume that there will be changes to the garbage collectors. There are already some major changes to YARV's garbage collector in the SVN, which will be part of YARV 1.9.3 and YARV 2.0.
If you want an ID for objects, use Object#object_id.

Instead of subclassing Array your class instances could delegate to one for the desired methods so that you don't inherit the overridden inspect method.

Related

How to track/find out which userdata are GC-ed at certain time?

I've written an app in LuaJIT, using a third-party GUI framework (FFI-based) + some additional custom FFI calls. The app suddenly loses part of its functionality at some point soon after being run, and I'm quite confident it's because of some unpinned objects being GC-ed. I assume they're only referenced from the C world1, so Lua GC thinks they're unreferenced and can free them. The problem is, I don't know which of the numerous userdata are unreferenced (unpinned) on Lua side?
To confirm my theory, I've run the app with GC disabled, via:
collectgarbage 'stop'
and lo, with this line, the app works perfectly well long past the point where it got broken before. Obviously, it's an ugly workaround, and I'd much prefer to have the GC enabled, and the app still working correctly...
I want to find out which unpinned object (userdata, I assume) gets GCed, so I can pin it properly on Lua side, to prevent it being GCed prematurely. Thus, my question is:
(How) can I track which userdata objects got collected when my app loses functionality?
One problem is, that AFAIK, the LuaJIT FFI already assigns custom __gc handlers, so I cannot add my own, as there can be only one per object. And anyway, the framework is too big for me to try adding __gc in each and every imaginable place in it. Also, I've already eliminated the "most obviously suspected" places in the code, by removing local from some variables — thus making them part of _G, so I assume not GC-able. (Or is that not enough?)
1 Specifically, WinAPI.
For now, I've added some ffi.gc() handlers to some of my objects (printing some easily visible ALL-CAPS messages), then added some eager collectgarbage() calls to try triggering the issue as soon as possible:
ffi.gc(foo, function()
print '\n\nGC FOO !!!\n\n'
end)
[...]
collectgarbage()
And indeed, this exposed some GCing I didn't expect. Specifically, it led me to discover a note in luajit's FFI docs, which is most certainly relevant in my case:
Please note that [C] pointers [...] are not followed by the garbage collector. So e.g. if you assign a cdata array to a pointer, you must keep the cdata object holding the array alive [in Lua] as long as the pointer is still in use.

Why does "garbage" data appear to not be meaningful?

I have always wondered why garbage data appears to not be meaningful. For clarity, what I mean by "garbage" is data that is just whatever happens to be at a particular memory address, that you have access to because of something like forgetting to initialize a variable.
For example, printing out an unused array gave me this:
#°õN)0ÿÿl¯ÿ¯ÿ ``¯ÿ¯ÿ #`¯ÿø+))0 wy¿[d
Obviously, this is useless for my application, but it also seems like it is not anything useful for any application. Why is this? Is there some sort of data protection going on here perhaps?
As you state in your question:
... "garbage" is data that is just whatever happens to be at a particular memory address, that you have access to because of something like forgetting to initialize a variable.
This implies that something else used to be in that memory before you got to use it for your variable. Whatever used to be there may or may not have any relation to how you wish to use the variable. That is, most languages do not force memory used for one type of object to be reused for the exact same type.
This means, if memory was used to store a pointer, and then released, that same memory may be used to store a string. If the pointer value was read out as if it was a string, something that looks like garbage may appear. This is because the bytes used to represent a pointer value are not restricted to the values that correspond to printable ASCII values.
A common way to detect a buffer overrun has occurred in a program is to examine a pointer value and see if it contains printable ASCII values. In this case, the user of the memory as a pointer sees junk, but in this case it is "printable".
Of course memory is never garbage, unless you make a conscious effort. After all, you are on a deterministic machine, even if it doesn't always seem like it. (Of course, if you interprete arbitrary bytes as text then it's unlikely that you see yourself as ASCII art, although you would deserve it.)
That was the reason for one of the worst bugs in history, quite recently, cf. https://xkcd.com/1354/. Where do you live to have missed it?

Concept of 'serializing' complete memory of object

I would like to ask a very general question about a technical concept of which I do not know whether it exists or whether it is feasible at all.
The idea is the following:
I have an object in Garbage Collected language (e.g. C# or Java). The objects may itself contain several objects but there is no reference to any other objects that are not sub-element of the objects (or the object itself).
Theoretically it would be possible to get the memory used by this object which is most likely not a connected piece. Because I have some knowledge about the objects I can find all reference variables/properties and pointers that at the end point to another piece of the memory (probably indirectly, depending on the implementation of the programming language and virtual machine). I can take this pieces of memory combining them to a bigger piece of memory (correcting the references/pointers so that they are still intact). This piece of memory, basically bytes, could be written to a storage for example a database or a redis cache. On another machine I could theoretically load this object again an put it into the memory of the virtual machine (maybe again correct the references/pointers if they are absolute and not relative). Then I should have the same object on the other VM. The object can as complicated as I want, may also contain events or whatever and I would be able to get the state of the object transfered to anther VM (running on another computer). The only condition is that it would not contain references to something outside the objects. And of course I have to know the class type of the object on the other VM.
I ask this question because I want to share the state of an object and I think all this serialization work is just an overhead and it would be very simple if I could just freeze the memory and transport to another VM.
Is something like this possible, I'd say yes, though it might be complicated. maybe it is not possible with some VM's due to their architecture. Does something like this exist in any programming language? Maybe even in non garbage collected languages?
NOTE: I am not sure what tags should be added to this question except from programming-language, also I am not sure if there might be a better place for such a question. So please forgive me.
EDIT:
Maybe the concept can be compared to the initrd on Linux or hibernation in general.
you will have to collect all references to other objects. including graphs of objects (cycles) without duplications. it would require some kind of 'stop the world' at least for the serializing thread. it's complicated to do effectively but possible - native serialization mechanisms in many languages (java) are doing it for the developer.
you will need some kind of VM to abstract from the byte order in different hardware architectures.
you will have to detach object from any kind of environment. you can't pass objects representing threads, files handles, sockets etc. how will you detect it?
in nowadays systems memory is virtual so it will be impossible to simply copy addresses from one machine to another - you will have to translate them
objects are not only data visible to developer, it's also structure, information of sandboxing, permissions, superclasses, what method/types were already loaded and which are still not loaded because of optimalizations and lazy loading, garbage collector metadata etc
version of your object/class. on one machine class A can be created from source ver 1 but on another machine there allready might be objects of class A built from source of version 2
take performacne into consideration. will it be faster then old-school serialization? what benefits will it have?
and probably many more things none of us thought about
so: i've never heard of such solution. it seems theoretically doable but for some reason no one have ever done that. everyone offers plain old programmatic serialization. maybe you discover new, better way but keep in mind you'll be going against the crowd

How to destroy Ruby object?

Suppose there is simple object like:
object = Object.new
As I know this creates Object in memory (RAM).
Is there a way to delete this object from RAM?
Other than hacking the underlying C code, no. Garbage collection is managed by the runtime so you don't have to worry about it. Here is a decent reference on the algorithm in Ruby 2.0.
Once you have no more references to the object in memory, the garbage collector will go to work. You should be fine.
The simple answer is, let the GC (garbage collector) do its job.
When you are ready to get rid of that reference, just do object = nil. And don't make reference to the object.
The garbage collector will eventually collect that and clear the reference.
(from ruby site)
=== Implementation from GC
------------------------------------------------------------------------------
GC.start -> nil
GC.start(full_mark: true, immediate_sweep: true) -> nil
------------------------------------------------------------------------------
Initiates garbage collection, unless manually disabled.
This method is defined with keyword arguments that default to true:
def GC.start(full_mark: true, immediate_sweep: true); end
Use full_mark: false to perform a minor GC. Use immediate_sweep: false to
defer sweeping (use lazy sweep).
Note: These keyword arguments are implementation and version dependent. They
are not guaranteed to be future-compatible, and may be ignored if the
underlying implementation does not support them.
Ruby Manages Garbage Collection Automatically
For the most part, Ruby handles garbage collection automatically. There are some edge cases, of course, but in the general case you should never have to worry about garbage collection in a typical Ruby application.
Implementation details of garbage collection vary between versions of Ruby, but it exposes very few knobs to twiddle and for most purposes you don't need them. If you find yourself under memory pressure, you may want to re-evaluate your design decisions rather than trying to manage the symptom of excess memory consumption.
Manually Trigger Garbage Collection
In general terms, Ruby marks objects for garbage collection when they go out of scope or are no longer referenced. However, some objects such as Symbols never get collected, and persist for the entire run-time of your program.
You can manually trigger garbage collection with GC#start, but can't really free blocks of memory the way you can with C programs from within Ruby. If you find yourself needing to do this, you may want to solve the underlying X/Y problem rather than trying to manage memory directly.
You can't explicitly destroy object. Ruby has automatic memory management. Objects no longer referenced from anywhere are automatically collected by the garbage collector built in the interpreter.
Good article to read on how to do allocation wisely, and a few tools you can use to fine tune.
http://merbist.com/2010/07/29/object-allocation-why-you-should-care/

What is the design rationale behind HandleScope?

V8 requires a HandleScope to be declared in order to clean up any Local handles that were created within scope. I understand that HandleScope will dereference these handles for garbage collection, but I'm interested in why each Local class doesn't do the dereferencing themselves like most internal ref_ptr type helpers.
My thought is that HandleScope can do it more efficiently by dumping a large number of handles all at once rather than one by one as they would in a ref_ptr type scoped class.
Here is how I understand the documentation and the handles-inl.h source code. I, too, might be completely wrong since I'm not a V8 developer and documentation is scarce.
The garbage collector will, at times, move stuff from one memory location to another and, during one such sweep, also check which objects are still reachable and which are not. In contrast to reference-counting types like std::shared_ptr, this is able to detect and collect cyclic data structures. For all of this to work, V8 has to have a good idea about what objects are reachable.
On the other hand, objects are created and deleted quite a lot during the internals of some computation. You don't want too much overhead for each such operation. The way to achieve this is by creating a stack of handles. Each object listed in that stack is available from some handle in some C++ computation. In addition to this, there are persistent handles, which presumably take more work to set up and which can survive beyond C++ computations.
Having a stack of references requires that you use this in a stack-like way. There is no “invalid” mark in that stack. All the objects from bottom to top of the stack are valid object references. The way to ensure this is the LocalScope. It keeps things hierarchical. With reference counted pointers you can do something like this:
shared_ptr<Object>* f() {
shared_ptr<Object> a(new Object(1));
shared_ptr<Object>* b = new shared_ptr<Object>(new Object(2));
return b;
}
void g() {
shared_ptr<Object> c = *f();
}
Here the object 1 is created first, then the object 2 is created, then the function returns and object 1 is destroyed, then object 2 is destroyed. The key point here is that there is a point in time when object 1 is invalid but object 2 is still valid. That's what LocalScope aims to avoid.
Some other GC implementations examine the C stack and look for pointers they find there. This has a good chance of false positives, since stuff which is in fact data could be misinterpreted as a pointer. For reachability this might seem rather harmless, but when rewriting pointers since you're moving objects, this can be fatal. It has a number of other drawbacks, and relies a lot on how the low level implementation of the language actually works. V8 avoids that by keeping the handle stack separate from the function call stack, while at the same time ensuring that they are sufficiently aligned to guarantee the mentioned hierarchy requirements.
To offer yet another comparison: an object references by just one shared_ptr becomes collectible (and actually will be collected) once its C++ block scope ends. An object referenced by a v8::Handle will become collectible when leaving the nearest enclosing scope which did contain a HandleScope object. So programmers have more control over the granularity of stack operations. In a tight loop where performance is important, it might be useful to maintain just a single HandleScope for the whole computation, so that you won't have to access the handle stack data structure so often. On the other hand, doing so will keep all the objects around for the whole duration of the computation, which would be very bad indeed if this were a loop iterating over many values, since all of them would be kept around till the end. But the programmer has full control, and can arrange things in the most appropriate way.
Personally, I'd make sure to construct a HandleScope
At the beginning of every function which might be called from outside your code. This ensures that your code will clean up after itself.
In the body of every loop which might see more than three or so iterations, so that you only keep variables from the current iteration.
Around every block of code which is followed by some callback invocation, since this ensures that your stuff can get cleaned if the callback requires more memory.
Whenever I feel that something might produce considerable amounts of intermediate data which should get cleaned (or at least become collectible) as soon as possible.
In general I'd not create a HandleScope for every internal function if I can be sure that every other function calling this will already have set up a HandleScope. But that's probably a matter of taste.
Disclaimer: This may not be an official answer, more of a conjuncture on my part; but the v8 documentation is hardly
useful on this topic. So I may be proven wrong.
From my understanding, in developing various v8 based backed application. Its a means of handling the difference between the C++ and javaScript environment.
Imagine the following sequence, which a self dereferencing pointer can break the system.
JavaScript calls up a C++ wrapped v8 function : lets say helloWorld()
C++ function creates a v8::handle of value "hello world =x"
C++ returns the value to the v8 virtual machine
C++ function does its usual cleaning up of resources, including dereferencing of handles
Another C++ function / process, overwrites the freed memory space
V8 reads the handle : and the data is no longer the same "hell!#(#..."
And that's just the surface of the complicated inconsistency between the two; Hence to tackle the various issues of connecting the JavaScript VM (Virtual Machine) to the C++ interfacing code, i believe the development team, decided to simplify the issue via the following...
All variable handles, are to be stored in "buckets" aka HandleScopes, to be built / compiled / run / destroyed by their
respective C++ code, when needed.
Additionally all function handles, are to only refer to C++ static functions (i know this is irritating), which ensures the "existence"
of the function call regardless of constructors / destructor.
Think of it from a development point of view, in which it marks a very strong distinction between the JavaScript VM development team, and the C++ integration team (Chrome dev team?). Allowing both sides to work without interfering one another.
Lastly it could also be the sake of simplicity, to emulate multiple VM : as v8 was originally meant for google chrome. Hence a simple HandleScope creation and destruction whenever we open / close a tab, makes for much easier GC managment, especially in cases where you have many VM running (each tab in chrome).

Resources