What is the meaning of :FREE in ObjectSpace.count_objects on ruby MRI - ruby

I would like to know what the count associated with the key :FREE returned from ObjectSpace.count_object. The documentation says this hash is implementation-specific, and so my question refers specifically to MRI ruby 2.1.
There have been at least two questions about this (here and here), but no answers regarding :FREE.
Any ideas?
In some cases, the free count is much higher than the number of objects accessible through ObjectSpace.each_object and thus I don't seem to have any information about them. Are they taking up memory. In my program the :FREE count is high even after running garbage collection.

We can find the meaning of :FREE straight from the implementation itself (from gc.c )
* The keys starting with +:T_+ means live objects.
* For example, +:T_ARRAY+ is the number of arrays.
* +:FREE+ means object slots which is not used now.
* +:TOTAL+ means sum of above.
Then we can take a look at the tests for it (from test_gc.rb) :
assert_equal(count[:TOTAL]-count[:FREE], stat[:heap_live_slots])
assert_equal(count[:FREE], stat[:heap_free_slots])
And finally, we can double check there's no funny business going on with:
GC.stat[:heap_free_slot] == ObjectSpace.count_objects[:FREE]
irb(main):001:0> GC.stat[:heap_free_slot] == ObjectSpace.count_objects[:FREE]
=> true
So, :FREE indicates the number of allocated slots on the heap that haven't been used.

Related

Ruby, immutable integers and unused objects

a = 1
a += 1
=> 2
The original object 1 is now unused and this is not very performative. Why are integers immutable in ruby? I've looked on stackoverflow but could find no explanation.
Integers, and all Numerics, are immutable so there is only ever one of each. We can see this by checking their #object_id.
2.6.4 :001 > a = 1
=> 1
2.6.4 :002 > a.object_id
=> 3
2.6.4 :003 > 1.object_id
=> 3
2.6.4 :004 > b = 1
=> 1
2.6.4 :005 > b.object_id
=> 3
This behavior is also documented in Numeric.
Other core numeric classes [other than Numeric] such as Integer are implemented as immediates, which means that each Integer is a single immutable object which is always passed by value. There can only ever be one instance of the integer 1, for example. Ruby ensures this by preventing instantiation. If duplication is attempted, the same instance is returned.
Having a single immutable object for each Integer saves memory in normal use, you're going to use 1 a lot. Everyone sharing the same 1 saves a lot of memory, but that means it must be immutable else adding to one object would change other objects and that would be bad.
Not having to constantly deallocate and allocate the same Integer over and over is faster and reduces memory fragmentation. If they're really unused, Ruby will garbage collect them. Ruby's garbage collection is constantly being improved and can be tweaked with many environment variables.
It also simplifies their implementation and makes them more performant. Immutable objects can cache any calculations confident the cache will never need to be invalidated.
Why are integers immutable in ruby? I've looked on stackoverflow but could find no explanation.
Let's assume that Integers were mutable:
1 << 1
1 + 1
#=> 4
I am guessing that the reason that you find no explanation is that most assume it is self-evident that the ability to change the value of 1 to be anything other than 1 would lead to an absolutely insane maintenance nightmare.
Think of the ability to mutate integers and what that would entail for the ruby programmer. It would give them too much power in changing how the language observes the world. An integer object is a sequence of binary numbers in memory and if you increment the object by one, is it still one? No. That's why Ruby uses variables to point to the object rather than to mutate it.
Immutable integers save on consumption in the long run. I think what you are questioning here is the decision of ruby to make the original object immutable and to store it in memory rather than mutate it, which increases memory consumption? Well, a is not a global variable and should be cleared by garbage collection eventually. Ruby uses DRY principles for object allocation and it makes sense to retain objects rather than create them over and over. If you have more objects than ruby can fit into memory, it must allocate more memory, which is expensive on the OS but your code should be cognisant of that. Ruby does free memory when there is too much allocated. Ruby objects are stored in Ruby object heap rather than malloc heap which the garbage collector cleans often.
If integers were mutable then modifying a variable that is used elsewhere in an application would cause hard-to-fix regressions and would cause a lot of issues in web applications. As for the concern of retention and performance, look at the following example:
require 'newrelic_rpm'
a = 0
loop do
a += 1
if a % 10_000 = 0
p a
p NewRelic::Agent::Samplers::MemorySampler.new.sampler.get_sample
end
end
Observe how many objects are 'kept' and how much memory is consumed by the process. Now, think of how many objects are going to be kept in a real world program and think if the memory consumption will be a real issue. If you do find yourself encountering this issue, to which I don't think you will, see it as a notification to improve your code.

Why can't ruby use most of the 2^X numbers as object ids?

ObjectSpace._id2ref gives us the object from the Ruby's Object Space, it has an object against id in sequence starting from 0, however, if we try to see object on id 4 it gives an error as
2.6.3 :121 > ObjectSpace._id2ref(4)
Traceback (most recent call last):
2: from (irb):121
1: from (irb):121:in `_id2ref'
RangeError (0x0000000000000004 is not id value)
Also, I figured that it's the same behaviour for 2^x values(except 1, 2, 8).
(0..10).each do |exp|
object_id = 2**exp
begin
puts "Number: #{object_id} : #{ObjectSpace._id2ref(object_id)}"
rescue Exception => e
puts "Number: #{object_id} : #{e.message}"
end
end
Number: 1 : 0
Number: 2 : 2.0
Number: 4 : 0x0000000000000004 is not id value
Number: 8 : nil
Number: 16 : 0x0000000000000010 is not id value
Number: 32 : 0x0000000000000020 is not id value
Number: 64 : 0x0000000000000040 is not id value
Number: 128 : 0x0000000000000080 is not symbol id value
Number: 256 : 0x0000000000000100 is not id value
Number: 512 : 0x0000000000000200 is not id value
Number: 1024 : 0x0000000000000400 is not id value
Why can't ruby use these specific numbers as object ids?
Also, what's different for (1,2,8)? and why error is different for 128?
First, it is very important to make a couple of things crystal clear:
There are exactly two guarantees Ruby makes about object IDs. These two guarantees are the only thing you are allowed to rely on. You must not make any assumptions about object IDs other than these two guarantees:
An object has the same ID for its entire lifetime.
No two objects have the same ID at the same time.
[Note: this means in particular that different objects can have the same ID at different times, i.e. that IDs can be recycled.]
An object ID is an opaque identifier. You must not make any assumptions about its structure or about any particular value.
Any particular implementation of object IDs is a private internal implementation detail of a specific version of a specific implementation running in a specific environment at a specific moment. There is no guarantee that the results will be the same with a different implementation. There is no guarantee that the results will be the same with a different version of the same implementation. There is no guarantee that the results will be the same with the same version of the same implementation running in a different environment. In fact, there is not even a guarantee that the results will be the same between two runs of the same code on the same version of the same implementation in the same environment.
ObjectSpace::_id2ref is an abomination. It should not even exist. It most certainly should not be used. It breaks object-orientation, it breaks encapsulation, it breaks safety.
Just as an example: unfortunately, you don't say which version of which implementation you are running in which environment. However, it looks like you are running YARV 2.6.3 in a 64-bit environment.
If you were to run that exact same code on the exact same version of YARV in a 32-bit environment, you would get different results. If you were to run that exact same code on an older version of YARV (pre-2.0) in the exact same environment, you would get different results.
Let's address the first, implicit, assumption which I think I see in your question. You seem to think that any ID should resolve to an object. It's easy to see that this cannot be true: there are infinitely many IDs, but for every run of a program, there are only finitely many objects, so there will always be infinitely many IDs which don't resolve to an object.
This already explains most of your results, namely the ones for 4, 16, 32, 64, 256, 512, and 1024.
So, with that out of the way, here's a high-level explanation of why there seems to some sort of structure to the IDs, and what that structure is. (But let me remind you again, that this explanation only applies to 64 bit systems, not to 32 bit, it only applies to YARV, it only applies to versions of YARV 2.0 or newer, and it is quite possible that it will no longer apply to YARV 3.0.)
In YARV, the developers made the decision that the object ID is the same thing as the memory address of the object header. This makes it easy to ensure the "rules" of object IDs: you can't have multiple objects at the same memory address at the same time, and an object will not change its memory address.
(Actually, it turns out that the second one is already a quite severe restriction: many modern high-performance garbage collectors depend on being able to move objects around in memory. This is not possible if you assume that object ID == memory address. Which means you will not be able to use any of those high-performance algorithms.)
On pretty much all modern machines, memory access is word-aligned. While it is possible to address individual bytes, that is generally slower or more awkward. So, we can basically assume that if we allocate memory, it will be aligned on a word-boundary. Which means that all memory addresses will be divisible by 8 on 64-bit systems and 4 on 32-bit systems, or in other words, that all memory addresses will end in 3 (64-bit) or 2 (32-bit) zero bits. Or, in other words: 87.5% (75%) of the address space are unused.
On the other hand, it would be quite a waste to represent Integers as a full-blown Ruby object:
They are immutable, which means we don't have to store any state.
They can't have instance variables, which means we don't have to store an instance variable table.
They can't have a singleton class, which means we don't have to store a __klass__ pointer.
They can't be extended.
And so on …
What this means, is that we can optimize the representation of Integers by not storing them as objects at all. All we need is some special case in the engine, so that if someone asks for the class of, say, 42, instead of trying to look at 42's __klass__ pointer, the engine "magically" knows to just return the Integer class.
Once we have that in place, we can do a really cool trick, which is actually as old as the very first LISP and Smalltalk VMs, and it is called a tagged pointer representation. Normally, the value of a variable is a pointer to the object (header), but with a tagged pointer representation we can store the value of the object inside the pointer to the object itself!
All we need to do is to have some sort of tag on the pointer that tells the engine that this is actually not a pointer but a value disguised as a pointer. In some older machines, especially those specifically designed for running high-level languages, pointers did have a tag field specifically for holding, e.g. type information or access control. Modern machines don't have that, but we have those unused bits we can (ab)use as tag bits.
And that is what YARV is doing: When the last bit of a pointer is 1, then it's not actually a pointer, it's an Integer. In particular, an Integer is encoded in YARV by shifting it one bit to the left and setting the last bit to 1. This allows us to encode a 63-bit Integer in a 64-bit pointer, and do native integer arithmetic at it with no object overhead and only a little bit of bit shifting overhead.
And if you think about what this encoding means:
shifting one bit to the left is equivalent to multiplying by two
setting the last bit to 1 is equivalent to incrementing by 1
Then you can explain the first pattern: a small Integer with value n is encoded as the "quasi-pointer" 2n + 1, and since "memory address" and object ID are the same in YARV (even though this is not actually a memory address, because there is no object which could have an address), it will have the object ID 2n + 1.
Integers that don't fit into 63 bit (31 bit), are allocated as objects like any other object. In different engines, these have different names, e.g. in the Smalltalk-80 VM, they are called SmallInts, in YARV, they are called Fixnums (and the ones that don't fit into a Fixnum are called Bignums). They actually used to be different subclasses of a fully-abstract Integer class in older versions of YARV, but this was considered a mistake. (It's really an internal optimization and should not be visible to the programmer.) In current versions of YARV, Fixnum and Bignum are aliases for Integer and using them gives a deprecation warning.
This explains your result for 1. If you had tried out ObjectSpace._id2ref(3), the result would have 1, then ObjectSpace._id2ref(5) would be 2, and so on.
And we still are using only 62.5% of the address space (on a 64-bit system)!
So, let's think about what else we might want to represent in this way.
YARV has a very similar optimization for Floats. Floating point numbers that fit into 62-bits are called flonums and are represented similar, with a tag of 10 at the end. (YARV does not use flonums on 32-bit platforms.)
This explains your result for ObjectSpace._id2ref(2). If you had tried ObjectSpace._id2ref(6), the result would have been -2.0.
And a similar trick is also played for Symbols. I won't explain it here in detail, because a) I don't actually fully know how it works, and b) it is slightly more complex, because the value being encoded isn't directly the Symbol value, rather it is an index into the Symbol table. However, that explains your result for 128.
Now, lastly, there is a completely different part of the address space that is also unused: the low addresses. On most modern Operating Systems, the low addresses are reserved for mapping the kernel memory directly into the user process in order to speed up the user space ↔︎ kernel space transition. Plus, there is another reason the very low addresses are kept free: in C, it is illegal to dereference a NULL pointer. Now, one way of implementing this, would be for the runtime to track all pointer dereferences and check whether they are dereferencing the NULL pointer. But there is an easier way: just give the NULL pointer an actual memory address, but one that is never allocated. That way, you don't have to do anything: if the code tries to dereference the pointer, the address doesn't exist, and the MMU will take care of raising an error. So, most C compilers compile the NULL pointer to the actual memory address 0, and in order to make sure that there is never any real data allocated at that address, they keep a whole area around address 0 free.
This means that the low addresses are never used, and we can (ab)use them to represent even more "interesting" objects. Now, YARV uses the very low addresses to represent the following objects:
false at address 0, which has the additional advantage that 0 is considered false in C.
nil at address 8 (4 in 32-bit).
true at address 20 (2 in 32-bit).
Qundef (a special internal value inside the engine that denotes an undefined value) at address 52 (6 in 32-bit).
And that explains your number 8.
This also means that your 4, 16, 32, 64, 256, 512, and 1024 will probably never resolve to an object, because they are in the low address range where the C library will simply never allocate memory.
As a closing remark, I want to repeat one last time that all of this is a private internal implementation detail of a specific version of YARV running in a specific environment. You must not rely on any of this, ever.
When flonums were introduced in YARV, and on some platforms nil no longer had object ID 4, this did break some code, and it did cause some confusion (as evidenced e.g. by questions on Stack Overflow), even though the YARV developers are allowed to change object IDs at will, because there are no guarantees being made about any particular ID values or the structure of IDs. Please, do not make the same mistake.

Fastest data structure with default values for undefined indexes?

I'm trying to create a 2d array where, when I access an index, will return the value. However, if an undefined index is accessed, it calls a callback and fills the index with that value, and then returns the value.
The array will have negative indexes, too, but I can overcome that by using 4 arrays (one for each quadrant around 0,0).
You can create a Matrix class that relies on tuples and dictionary, with the following behavior :
from collections import namedtuple
2DMatrixEntry = namedtuple("2DMatrixEntry", "x", "y", "value")
matrix = new dict()
defaultValue = 0
# add entry at 0;1
matrix[2DMatrixEntry(0,1)] = 10.0
# get value at 0;1
key = 2DMatrixEntry(0,1)
value = {defaultValue,matrix[key]}[key in matrix]
Cheers
This question is probably too broad for stackoverflow. - There is not a generic "one size fits all" solution for this, and the results depend a lot on the language used (and standard library).
There are several problems in this question. First of all let us consider a 2d array, we say this is simply already part of the language and that such an array grows dynamically on access. If this isn't the case, the question becomes really language dependent.
Now often when allocating memory the language automatically initializes the spots (again language dependent on how this happens and what the best method is, look into RAII). Though I can foresee that actual calculation of the specific cell might be costly (compared to allocation). In that case an interesting thing might be so called "two-phase construction". The array has to be filled with tuples/objects. The default construction of an object sets a bit/boolean to false - indicating that the value is not ready. Then on acces (ie a get() method or a operator() - language dependent) if this bit is false it constructs, else it just reads.
Another method is to use a dictionary/key-value map. Where the key would be the coordinates and the value the value. This has the advantage that the problem of construct-on-access is inherit to the datastructure (though again language dependent). The drawback of using maps however is that lookup speed of a value changes from O(1) to O(logn). (The actual time is widely different depending on the language though).
At last I hope you understand that how to do this depends on more specific requirements, the language you used and other libraries. In the end there is only a single data structure that is in each language: a long sequence of unallocated values. Anything more advanced than that depends on the language.

If ruby encourages duck typing so much, why don't we have Hash.count instead of Hash.length?

This is something that really confuses me, it seems like time and time again I run into methods in ruby native data types that do the same thing (essentially), and yet have different names. If duck typing is so strongly encouraged by ruby and the ruby community, why aren't these methods named consistently across types?
You seem to imply that Hash does not have a length method and/or that other enumerables don't have a count method. That is not true.
count is a method defined in the Enumerable module and thus available on all enumerables. It differs from size and length in the following ways:
It (optionally) takes a block specifying which kind of elements to count.
It's available on all enumerables - not just those that keep track of their size - however it has a runtime in O(n) for those that don't (and always when given a block of course).
length and size (which are synonyms) are methods defined on all enumerable classes that keep track of their size (including Hash). They differ from count in that they always return the length in O(1) time and don't take a block.
In summary: You can call length or size on any object that keeps track of its size and you can call count on any enumerable. So duck typing is not hampered in any way.

In Ruby, can I make a reference to an array offset?

In Ruby, can I do something C-like, like this (with my made-up operator '&'):
a = [1,2,3,4] and b = &a[2], b => [3,4], and if I set b[0] = 99, a => [1,2,-9,4]?
If the elements of an array are integers, does Ruby necessary store them consecutively in a
contiguous part of memory? I'm guessing "no", that only addresses are stored, integers being
objects, like everything else in Ruby.
If the answer to #2 is "yes" (which I doubt), is there a way to efficiently shift blocks of
memory, as one can do in C, for example.
There is no such functionality built into Ruby (Ruby arrays are not built of cons cells, and taking the address is much lower level than Ruby operates), though honestly it would not be hard to write something like that.
To answer the second question: It wouldn't necessarily be a contiguous array of integers. MRI treats integers as immediate values (with the least significant bit as a flag indicating whether a word represents an integer or an object address), so it would probably store it that way. Other implementations do it their own way.

Resources