Ruby, immutable integers and unused objects - ruby

a = 1
a += 1
=> 2
The original object 1 is now unused and this is not very performative. Why are integers immutable in ruby? I've looked on stackoverflow but could find no explanation.

Integers, and all Numerics, are immutable so there is only ever one of each. We can see this by checking their #object_id.
2.6.4 :001 > a = 1
=> 1
2.6.4 :002 > a.object_id
=> 3
2.6.4 :003 > 1.object_id
=> 3
2.6.4 :004 > b = 1
=> 1
2.6.4 :005 > b.object_id
=> 3
This behavior is also documented in Numeric.
Other core numeric classes [other than Numeric] such as Integer are implemented as immediates, which means that each Integer is a single immutable object which is always passed by value. There can only ever be one instance of the integer 1, for example. Ruby ensures this by preventing instantiation. If duplication is attempted, the same instance is returned.
Having a single immutable object for each Integer saves memory in normal use, you're going to use 1 a lot. Everyone sharing the same 1 saves a lot of memory, but that means it must be immutable else adding to one object would change other objects and that would be bad.
Not having to constantly deallocate and allocate the same Integer over and over is faster and reduces memory fragmentation. If they're really unused, Ruby will garbage collect them. Ruby's garbage collection is constantly being improved and can be tweaked with many environment variables.
It also simplifies their implementation and makes them more performant. Immutable objects can cache any calculations confident the cache will never need to be invalidated.

Why are integers immutable in ruby? I've looked on stackoverflow but could find no explanation.
Let's assume that Integers were mutable:
1 << 1
1 + 1
#=> 4
I am guessing that the reason that you find no explanation is that most assume it is self-evident that the ability to change the value of 1 to be anything other than 1 would lead to an absolutely insane maintenance nightmare.

Think of the ability to mutate integers and what that would entail for the ruby programmer. It would give them too much power in changing how the language observes the world. An integer object is a sequence of binary numbers in memory and if you increment the object by one, is it still one? No. That's why Ruby uses variables to point to the object rather than to mutate it.
Immutable integers save on consumption in the long run. I think what you are questioning here is the decision of ruby to make the original object immutable and to store it in memory rather than mutate it, which increases memory consumption? Well, a is not a global variable and should be cleared by garbage collection eventually. Ruby uses DRY principles for object allocation and it makes sense to retain objects rather than create them over and over. If you have more objects than ruby can fit into memory, it must allocate more memory, which is expensive on the OS but your code should be cognisant of that. Ruby does free memory when there is too much allocated. Ruby objects are stored in Ruby object heap rather than malloc heap which the garbage collector cleans often.
If integers were mutable then modifying a variable that is used elsewhere in an application would cause hard-to-fix regressions and would cause a lot of issues in web applications. As for the concern of retention and performance, look at the following example:
require 'newrelic_rpm'
a = 0
loop do
a += 1
if a % 10_000 = 0
p a
p NewRelic::Agent::Samplers::MemorySampler.new.sampler.get_sample
end
end
Observe how many objects are 'kept' and how much memory is consumed by the process. Now, think of how many objects are going to be kept in a real world program and think if the memory consumption will be a real issue. If you do find yourself encountering this issue, to which I don't think you will, see it as a notification to improve your code.

Related

Is there a way to get the Ruby runtime to combine frozen identical objects into a single instance?

I have data in memory, especially strings, that have large numbers of duplicates. We're hitting the ceiling with memory sometimes and are trying to reduce our footprint. I thought that if I froze the strings, then the Ruby runtime would combine them into single objects in memory. So I thought that this code would return a lower number, ideally, 1, but it did not:
a = Array.new(1000) { 'foo'.dup.freeze } # create separate objects, but freeze them
sleep 5 # give the runtime some time to combine the objects
a.map(&:object_id).uniq.size # => 1000
I guess this makes sense, because if there was a reference to the duplicated object (e.g. object id #202), and all of the frozen strings are combined to use #200, then dereferencing #202 will fail. So maybe my question doesn't make sense.
I guess the best strategy for me to save memory might be to convert the strings to symbols. I am aware that they will never be garbage collected, there would be a small enough number of them that this would not be a problem. Is there a better way?
You basically have the right idea, but in my opinion you found a big gotcha in Ruby. You are correct that Ruby can dedup frozen strings to save memory but in general frozen ≠ deduped!!!
tl;dr the reason is because the two operations have different semantics. Always use String#-# if you want it deduped.
Recall that freeze is a method of Object, so it has to work with every class. In English, freeze is "make it so no further changes can be made to this object and also return the same object so that I can keep calling methods on it". In particular, it would be odd if x.freeze != x. Imagine if I had two arrays that I was modifying, then decided to freeze them. Would it make sense for the interpreter to then iterate through both arrays to see if their contents are equal and to decide to completely throw away one of them? That could be very expensive. So in general freeze does not promise this behavior and always returns the same object, just frozen.
Deduping works very differently because when you call -myStr you're actually saying "return the unique frozen version of this string in memory". In most cases the whole point is to get a different object than the one in myStr (so that the GC can clean up that string and only keep the frozen one).
Unfortunately, the distinction is muddled since if you call freeze on a string literal, Ruby will dedup it automatically! This is sensible because there's no way to get a reference to the original literal object; the fact that the interpreter is allowing x.freeze != x doesn't matter, so we might as well save some memory. But it might also give the impression that freeze does guarantee deduping, when in fact it does not.
This gotcha was discussed when string deduping was first introduced, so it is definitely an intentional design decision by the Ruby developers.

Why can't ruby use most of the 2^X numbers as object ids?

ObjectSpace._id2ref gives us the object from the Ruby's Object Space, it has an object against id in sequence starting from 0, however, if we try to see object on id 4 it gives an error as
2.6.3 :121 > ObjectSpace._id2ref(4)
Traceback (most recent call last):
2: from (irb):121
1: from (irb):121:in `_id2ref'
RangeError (0x0000000000000004 is not id value)
Also, I figured that it's the same behaviour for 2^x values(except 1, 2, 8).
(0..10).each do |exp|
object_id = 2**exp
begin
puts "Number: #{object_id} : #{ObjectSpace._id2ref(object_id)}"
rescue Exception => e
puts "Number: #{object_id} : #{e.message}"
end
end
Number: 1 : 0
Number: 2 : 2.0
Number: 4 : 0x0000000000000004 is not id value
Number: 8 : nil
Number: 16 : 0x0000000000000010 is not id value
Number: 32 : 0x0000000000000020 is not id value
Number: 64 : 0x0000000000000040 is not id value
Number: 128 : 0x0000000000000080 is not symbol id value
Number: 256 : 0x0000000000000100 is not id value
Number: 512 : 0x0000000000000200 is not id value
Number: 1024 : 0x0000000000000400 is not id value
Why can't ruby use these specific numbers as object ids?
Also, what's different for (1,2,8)? and why error is different for 128?
First, it is very important to make a couple of things crystal clear:
There are exactly two guarantees Ruby makes about object IDs. These two guarantees are the only thing you are allowed to rely on. You must not make any assumptions about object IDs other than these two guarantees:
An object has the same ID for its entire lifetime.
No two objects have the same ID at the same time.
[Note: this means in particular that different objects can have the same ID at different times, i.e. that IDs can be recycled.]
An object ID is an opaque identifier. You must not make any assumptions about its structure or about any particular value.
Any particular implementation of object IDs is a private internal implementation detail of a specific version of a specific implementation running in a specific environment at a specific moment. There is no guarantee that the results will be the same with a different implementation. There is no guarantee that the results will be the same with a different version of the same implementation. There is no guarantee that the results will be the same with the same version of the same implementation running in a different environment. In fact, there is not even a guarantee that the results will be the same between two runs of the same code on the same version of the same implementation in the same environment.
ObjectSpace::_id2ref is an abomination. It should not even exist. It most certainly should not be used. It breaks object-orientation, it breaks encapsulation, it breaks safety.
Just as an example: unfortunately, you don't say which version of which implementation you are running in which environment. However, it looks like you are running YARV 2.6.3 in a 64-bit environment.
If you were to run that exact same code on the exact same version of YARV in a 32-bit environment, you would get different results. If you were to run that exact same code on an older version of YARV (pre-2.0) in the exact same environment, you would get different results.
Let's address the first, implicit, assumption which I think I see in your question. You seem to think that any ID should resolve to an object. It's easy to see that this cannot be true: there are infinitely many IDs, but for every run of a program, there are only finitely many objects, so there will always be infinitely many IDs which don't resolve to an object.
This already explains most of your results, namely the ones for 4, 16, 32, 64, 256, 512, and 1024.
So, with that out of the way, here's a high-level explanation of why there seems to some sort of structure to the IDs, and what that structure is. (But let me remind you again, that this explanation only applies to 64 bit systems, not to 32 bit, it only applies to YARV, it only applies to versions of YARV 2.0 or newer, and it is quite possible that it will no longer apply to YARV 3.0.)
In YARV, the developers made the decision that the object ID is the same thing as the memory address of the object header. This makes it easy to ensure the "rules" of object IDs: you can't have multiple objects at the same memory address at the same time, and an object will not change its memory address.
(Actually, it turns out that the second one is already a quite severe restriction: many modern high-performance garbage collectors depend on being able to move objects around in memory. This is not possible if you assume that object ID == memory address. Which means you will not be able to use any of those high-performance algorithms.)
On pretty much all modern machines, memory access is word-aligned. While it is possible to address individual bytes, that is generally slower or more awkward. So, we can basically assume that if we allocate memory, it will be aligned on a word-boundary. Which means that all memory addresses will be divisible by 8 on 64-bit systems and 4 on 32-bit systems, or in other words, that all memory addresses will end in 3 (64-bit) or 2 (32-bit) zero bits. Or, in other words: 87.5% (75%) of the address space are unused.
On the other hand, it would be quite a waste to represent Integers as a full-blown Ruby object:
They are immutable, which means we don't have to store any state.
They can't have instance variables, which means we don't have to store an instance variable table.
They can't have a singleton class, which means we don't have to store a __klass__ pointer.
They can't be extended.
And so on …
What this means, is that we can optimize the representation of Integers by not storing them as objects at all. All we need is some special case in the engine, so that if someone asks for the class of, say, 42, instead of trying to look at 42's __klass__ pointer, the engine "magically" knows to just return the Integer class.
Once we have that in place, we can do a really cool trick, which is actually as old as the very first LISP and Smalltalk VMs, and it is called a tagged pointer representation. Normally, the value of a variable is a pointer to the object (header), but with a tagged pointer representation we can store the value of the object inside the pointer to the object itself!
All we need to do is to have some sort of tag on the pointer that tells the engine that this is actually not a pointer but a value disguised as a pointer. In some older machines, especially those specifically designed for running high-level languages, pointers did have a tag field specifically for holding, e.g. type information or access control. Modern machines don't have that, but we have those unused bits we can (ab)use as tag bits.
And that is what YARV is doing: When the last bit of a pointer is 1, then it's not actually a pointer, it's an Integer. In particular, an Integer is encoded in YARV by shifting it one bit to the left and setting the last bit to 1. This allows us to encode a 63-bit Integer in a 64-bit pointer, and do native integer arithmetic at it with no object overhead and only a little bit of bit shifting overhead.
And if you think about what this encoding means:
shifting one bit to the left is equivalent to multiplying by two
setting the last bit to 1 is equivalent to incrementing by 1
Then you can explain the first pattern: a small Integer with value n is encoded as the "quasi-pointer" 2n + 1, and since "memory address" and object ID are the same in YARV (even though this is not actually a memory address, because there is no object which could have an address), it will have the object ID 2n + 1.
Integers that don't fit into 63 bit (31 bit), are allocated as objects like any other object. In different engines, these have different names, e.g. in the Smalltalk-80 VM, they are called SmallInts, in YARV, they are called Fixnums (and the ones that don't fit into a Fixnum are called Bignums). They actually used to be different subclasses of a fully-abstract Integer class in older versions of YARV, but this was considered a mistake. (It's really an internal optimization and should not be visible to the programmer.) In current versions of YARV, Fixnum and Bignum are aliases for Integer and using them gives a deprecation warning.
This explains your result for 1. If you had tried out ObjectSpace._id2ref(3), the result would have 1, then ObjectSpace._id2ref(5) would be 2, and so on.
And we still are using only 62.5% of the address space (on a 64-bit system)!
So, let's think about what else we might want to represent in this way.
YARV has a very similar optimization for Floats. Floating point numbers that fit into 62-bits are called flonums and are represented similar, with a tag of 10 at the end. (YARV does not use flonums on 32-bit platforms.)
This explains your result for ObjectSpace._id2ref(2). If you had tried ObjectSpace._id2ref(6), the result would have been -2.0.
And a similar trick is also played for Symbols. I won't explain it here in detail, because a) I don't actually fully know how it works, and b) it is slightly more complex, because the value being encoded isn't directly the Symbol value, rather it is an index into the Symbol table. However, that explains your result for 128.
Now, lastly, there is a completely different part of the address space that is also unused: the low addresses. On most modern Operating Systems, the low addresses are reserved for mapping the kernel memory directly into the user process in order to speed up the user space ↔︎ kernel space transition. Plus, there is another reason the very low addresses are kept free: in C, it is illegal to dereference a NULL pointer. Now, one way of implementing this, would be for the runtime to track all pointer dereferences and check whether they are dereferencing the NULL pointer. But there is an easier way: just give the NULL pointer an actual memory address, but one that is never allocated. That way, you don't have to do anything: if the code tries to dereference the pointer, the address doesn't exist, and the MMU will take care of raising an error. So, most C compilers compile the NULL pointer to the actual memory address 0, and in order to make sure that there is never any real data allocated at that address, they keep a whole area around address 0 free.
This means that the low addresses are never used, and we can (ab)use them to represent even more "interesting" objects. Now, YARV uses the very low addresses to represent the following objects:
false at address 0, which has the additional advantage that 0 is considered false in C.
nil at address 8 (4 in 32-bit).
true at address 20 (2 in 32-bit).
Qundef (a special internal value inside the engine that denotes an undefined value) at address 52 (6 in 32-bit).
And that explains your number 8.
This also means that your 4, 16, 32, 64, 256, 512, and 1024 will probably never resolve to an object, because they are in the low address range where the C library will simply never allocate memory.
As a closing remark, I want to repeat one last time that all of this is a private internal implementation detail of a specific version of YARV running in a specific environment. You must not rely on any of this, ever.
When flonums were introduced in YARV, and on some platforms nil no longer had object ID 4, this did break some code, and it did cause some confusion (as evidenced e.g. by questions on Stack Overflow), even though the YARV developers are allowed to change object IDs at will, because there are no guarantees being made about any particular ID values or the structure of IDs. Please, do not make the same mistake.

What is the meaning of :FREE in ObjectSpace.count_objects on ruby MRI

I would like to know what the count associated with the key :FREE returned from ObjectSpace.count_object. The documentation says this hash is implementation-specific, and so my question refers specifically to MRI ruby 2.1.
There have been at least two questions about this (here and here), but no answers regarding :FREE.
Any ideas?
In some cases, the free count is much higher than the number of objects accessible through ObjectSpace.each_object and thus I don't seem to have any information about them. Are they taking up memory. In my program the :FREE count is high even after running garbage collection.
We can find the meaning of :FREE straight from the implementation itself (from gc.c )
* The keys starting with +:T_+ means live objects.
* For example, +:T_ARRAY+ is the number of arrays.
* +:FREE+ means object slots which is not used now.
* +:TOTAL+ means sum of above.
Then we can take a look at the tests for it (from test_gc.rb) :
assert_equal(count[:TOTAL]-count[:FREE], stat[:heap_live_slots])
assert_equal(count[:FREE], stat[:heap_free_slots])
And finally, we can double check there's no funny business going on with:
GC.stat[:heap_free_slot] == ObjectSpace.count_objects[:FREE]
irb(main):001:0> GC.stat[:heap_free_slot] == ObjectSpace.count_objects[:FREE]
=> true
So, :FREE indicates the number of allocated slots on the heap that haven't been used.

Speedy attribute lookup in dynamically typed language?

I'm currently developing a dynamically typed language.
One of the main problems I'm facing during development is how to do fast runtime symbol lookups.
For general, free global and local symbols I simply index them and let each scope (global or local) keep an array of the symbols and quickly look them up using the index. I'm very happy with this approach.
However, for attributes in objects the problem is much harder. I can't use the same indexing scheme on them, because I have no idea which object I'm currently accessing, thus I don't know which index to use!
Here's an example in python which reflects what I want working in my language:
class A:
def __init__(self):
self.a = 10
self.c = 30
class B:
def __init__(self):
self.c = 20
def test():
if random():
foo = A()
else:
foo = B()
# There could even be an eval here that sets foo
# to something different or removes attribute c from foo.
print foo.c
Does anyone know any clever tricks to do the lookup quickly? I know about hash maps and splay trees, so I'm interesting if there is any ways to do it as efficient as my other lookup.
Once you've reached the point where looking up properties in the hash table isn't fast enough, the standard next step is inline caching. You can do this in JIT languages, or even bytecode compilers or interpreters, though it seems to be less common there.
If the shape of your objects can change over time (i.e. you can add new properties at runtime) you'll probably end up doing something similar to V8's hidden classes.
A technique known as maps can store the values for each attribute in a compact array. The knowledge which attribute name corresponds to which index is maintained in an auxiliary data structure (the eponymous map), so you don't immediately gain a performance benefit (though it does use memory more efficiently if many objects share a set of attributes). With a JIT compiler, you can make the map persistent and constant-fold lookups, so the final machine code can use constant offsets into the attributes array (for constant attribute names).
In an interpreter (I'll assume byte code), things are much harder because you don't have much opportunity to specialize code for specific objects. However, I have an idea myself for turning attribute names into integral keys. Maintain a global mapping assigning integral IDs to attribute names. When adding new byte code to the VM (loading from disk or compiling in memory), scan for strings used as attributes, and replace them with the associated ID, creating a new ID if the string hasn't been seen before. Instead of storing hash tables or similar mappings on each object - or in the map, if you use maps - you can now use sparse arrays, which are hopefully more compact and faster to operate on.
I haven't had a change to implement and test this, and you still need a sparse array. Unless you want to make all objects (or maps) take as many words of memory as there are distinct attribute names in the whole program, that is. At least you can replace string hash tables with integer hash tables.
Just by tuning a hash table for IDs as keys, you can make several optimizations: Don't invoke a hash function (use the ID as hash), remove some indirection and hence cache misses, save yourself the complexity of dealing with pathologically bad hash functions, etc.

Aren't modern computers powerful enough to handle Strings without needing to use Symbols (in Ruby)

Every text I've read about Ruby symbols talks about the efficiency of symbols over strings. But, this isn't the 1970s. My computer can handle a little bit of extra garbage collection. Am I wrong? I have the latest and greatest Pentium dual core processor and 4 gigs of RAM. I think that should be enough to handle some Strings.
Your computer may well be able to handle "a little bit of extra garbage collection", but what about when that "little bit" takes place in an inner loop that runs millions of times? What about when it's running on an embedded system with limited memory?
There are a lot of places you can get away with using strings willy-nilly, but in some you can't. It all depends on the context.
It's true, you don't need tokens so very badly for memory reasons. Your computer could undoubtedly handle all kinds of gnarly string handling.
But, in addition to being faster, tokens have the added advantage (especially with context coloring) of screaming out visually: LOOK AT ME, I AM A KEY OF A KEY-VALUE PAIR. That's a good enough reason to use them for me.
There's other reasons too... and the performance gain on lots of them might be more than you realize, especially doing something like comparison.
When comparing two ruby symbols, the interpreter is just comparing two object addresses. When comparing two strings, the interpreter has to compare every character one at a time. That kind of computation can add up if you're doing a lot of this.
Symbols have their own performance problems though... they are never garbage collected.
It's worth reading this article:
http://www.randomhacks.net/articles/2007/01/20/13-ways-of-looking-at-a-ruby-symbol
It's nice that symbols are guaranteed unique--that can have some nice effects that you wouldn't get from String (such as their addresses are always exactly equal I believe).
Plus they have a different meaning and you would want to use them in different areas, but ruby isn't too strict about that kind of stuff anyway, so I can understand your question.
Here's the real reason for the difference: strings are never the same. Every instance of a string is a separate object, even if the content is identical. And most operations on strings will make new string objects. Consider the following:
a = 'zowie'
b = 'zowie'
a == b #=> true
On the surface, it'd be easy to claim that a and b are the same. Most common sense operations will work as you'd expect. But:
a.object_id #=> 2152589920 (when I ran this in irb)
b.object_id #=> 2152572980
a.equal?(b) #=> false
They look the same, but they're different objects. Ruby had to allocate memory twice, perform the String#initialize method twice, etc. They're taking up two separate spots in memory. And hey! It gets even more fun when you try to modify them:
a += '' #=> 'zowie'
a.object_id #=> 2151845240
Here we add nothing to a and leave the content exactly the same -- but Ruby doesn't know that. It still allocates a whole new String object, reassigns the variable a to it, and the old String object sits around waiting for eventual garbage collection. Oh, and the empty '' string also gets a temporary String object allocated just for the duration of that line of code. Try it and see:
''.object_id #=> 2152710260
''.object_id #=> 2152694840
''.object_id #=> 2152681980
Are these object allocations fast on your slick multi-Gigahertz processor? Sure they are. Will they chew up much of your 4 GB of RAM? No they won't. But do it a few million times over, and it starts to add up. Most applications use temporary strings all over the place, and your code's probably full of string literals inside your methods and loops. Each of those string literals and such will allocate a new String object, every single time that line of code gets run. The real problem isn't even the memory waste; it's the time wasted when garbage collection gets triggered too frequently and your application starts hanging.
In contrast, take a look at symbols:
a = :zowie
b = :zowie
a.object_id #=> 456488
b.object_id #=> 456488
a == b #=> true
a.equal?(b) #=> true
Once the symbol :zowie gets made, it'll never make another one. Every time you refer to a given symbol, you're referring to the same object. There's no time or memory wasted on new allocations. This can also be a downside if you go too crazy with them -- they're never garbage collected, so if you start creating countless symbols dynamically from user input, you're risking an endless memory leak. But for simple literals in your code, like constant values or hash keys, they're just about perfect.
Does that help? It's not about what your application does once. It's about what it does millions of times.
One less character to type. That's all the justification I need to use them over strings for hash keys, etc.

Resources