Aren't modern computers powerful enough to handle Strings without needing to use Symbols (in Ruby) - ruby

Every text I've read about Ruby symbols talks about the efficiency of symbols over strings. But, this isn't the 1970s. My computer can handle a little bit of extra garbage collection. Am I wrong? I have the latest and greatest Pentium dual core processor and 4 gigs of RAM. I think that should be enough to handle some Strings.

Your computer may well be able to handle "a little bit of extra garbage collection", but what about when that "little bit" takes place in an inner loop that runs millions of times? What about when it's running on an embedded system with limited memory?
There are a lot of places you can get away with using strings willy-nilly, but in some you can't. It all depends on the context.

It's true, you don't need tokens so very badly for memory reasons. Your computer could undoubtedly handle all kinds of gnarly string handling.
But, in addition to being faster, tokens have the added advantage (especially with context coloring) of screaming out visually: LOOK AT ME, I AM A KEY OF A KEY-VALUE PAIR. That's a good enough reason to use them for me.
There's other reasons too... and the performance gain on lots of them might be more than you realize, especially doing something like comparison.
When comparing two ruby symbols, the interpreter is just comparing two object addresses. When comparing two strings, the interpreter has to compare every character one at a time. That kind of computation can add up if you're doing a lot of this.
Symbols have their own performance problems though... they are never garbage collected.
It's worth reading this article:
http://www.randomhacks.net/articles/2007/01/20/13-ways-of-looking-at-a-ruby-symbol

It's nice that symbols are guaranteed unique--that can have some nice effects that you wouldn't get from String (such as their addresses are always exactly equal I believe).
Plus they have a different meaning and you would want to use them in different areas, but ruby isn't too strict about that kind of stuff anyway, so I can understand your question.

Here's the real reason for the difference: strings are never the same. Every instance of a string is a separate object, even if the content is identical. And most operations on strings will make new string objects. Consider the following:
a = 'zowie'
b = 'zowie'
a == b #=> true
On the surface, it'd be easy to claim that a and b are the same. Most common sense operations will work as you'd expect. But:
a.object_id #=> 2152589920 (when I ran this in irb)
b.object_id #=> 2152572980
a.equal?(b) #=> false
They look the same, but they're different objects. Ruby had to allocate memory twice, perform the String#initialize method twice, etc. They're taking up two separate spots in memory. And hey! It gets even more fun when you try to modify them:
a += '' #=> 'zowie'
a.object_id #=> 2151845240
Here we add nothing to a and leave the content exactly the same -- but Ruby doesn't know that. It still allocates a whole new String object, reassigns the variable a to it, and the old String object sits around waiting for eventual garbage collection. Oh, and the empty '' string also gets a temporary String object allocated just for the duration of that line of code. Try it and see:
''.object_id #=> 2152710260
''.object_id #=> 2152694840
''.object_id #=> 2152681980
Are these object allocations fast on your slick multi-Gigahertz processor? Sure they are. Will they chew up much of your 4 GB of RAM? No they won't. But do it a few million times over, and it starts to add up. Most applications use temporary strings all over the place, and your code's probably full of string literals inside your methods and loops. Each of those string literals and such will allocate a new String object, every single time that line of code gets run. The real problem isn't even the memory waste; it's the time wasted when garbage collection gets triggered too frequently and your application starts hanging.
In contrast, take a look at symbols:
a = :zowie
b = :zowie
a.object_id #=> 456488
b.object_id #=> 456488
a == b #=> true
a.equal?(b) #=> true
Once the symbol :zowie gets made, it'll never make another one. Every time you refer to a given symbol, you're referring to the same object. There's no time or memory wasted on new allocations. This can also be a downside if you go too crazy with them -- they're never garbage collected, so if you start creating countless symbols dynamically from user input, you're risking an endless memory leak. But for simple literals in your code, like constant values or hash keys, they're just about perfect.
Does that help? It's not about what your application does once. It's about what it does millions of times.

One less character to type. That's all the justification I need to use them over strings for hash keys, etc.

Related

Is there a way to get the Ruby runtime to combine frozen identical objects into a single instance?

I have data in memory, especially strings, that have large numbers of duplicates. We're hitting the ceiling with memory sometimes and are trying to reduce our footprint. I thought that if I froze the strings, then the Ruby runtime would combine them into single objects in memory. So I thought that this code would return a lower number, ideally, 1, but it did not:
a = Array.new(1000) { 'foo'.dup.freeze } # create separate objects, but freeze them
sleep 5 # give the runtime some time to combine the objects
a.map(&:object_id).uniq.size # => 1000
I guess this makes sense, because if there was a reference to the duplicated object (e.g. object id #202), and all of the frozen strings are combined to use #200, then dereferencing #202 will fail. So maybe my question doesn't make sense.
I guess the best strategy for me to save memory might be to convert the strings to symbols. I am aware that they will never be garbage collected, there would be a small enough number of them that this would not be a problem. Is there a better way?
You basically have the right idea, but in my opinion you found a big gotcha in Ruby. You are correct that Ruby can dedup frozen strings to save memory but in general frozen ≠ deduped!!!
tl;dr the reason is because the two operations have different semantics. Always use String#-# if you want it deduped.
Recall that freeze is a method of Object, so it has to work with every class. In English, freeze is "make it so no further changes can be made to this object and also return the same object so that I can keep calling methods on it". In particular, it would be odd if x.freeze != x. Imagine if I had two arrays that I was modifying, then decided to freeze them. Would it make sense for the interpreter to then iterate through both arrays to see if their contents are equal and to decide to completely throw away one of them? That could be very expensive. So in general freeze does not promise this behavior and always returns the same object, just frozen.
Deduping works very differently because when you call -myStr you're actually saying "return the unique frozen version of this string in memory". In most cases the whole point is to get a different object than the one in myStr (so that the GC can clean up that string and only keep the frozen one).
Unfortunately, the distinction is muddled since if you call freeze on a string literal, Ruby will dedup it automatically! This is sensible because there's no way to get a reference to the original literal object; the fact that the interpreter is allowing x.freeze != x doesn't matter, so we might as well save some memory. But it might also give the impression that freeze does guarantee deduping, when in fact it does not.
This gotcha was discussed when string deduping was first introduced, so it is definitely an intentional design decision by the Ruby developers.

Ruby, immutable integers and unused objects

a = 1
a += 1
=> 2
The original object 1 is now unused and this is not very performative. Why are integers immutable in ruby? I've looked on stackoverflow but could find no explanation.
Integers, and all Numerics, are immutable so there is only ever one of each. We can see this by checking their #object_id.
2.6.4 :001 > a = 1
=> 1
2.6.4 :002 > a.object_id
=> 3
2.6.4 :003 > 1.object_id
=> 3
2.6.4 :004 > b = 1
=> 1
2.6.4 :005 > b.object_id
=> 3
This behavior is also documented in Numeric.
Other core numeric classes [other than Numeric] such as Integer are implemented as immediates, which means that each Integer is a single immutable object which is always passed by value. There can only ever be one instance of the integer 1, for example. Ruby ensures this by preventing instantiation. If duplication is attempted, the same instance is returned.
Having a single immutable object for each Integer saves memory in normal use, you're going to use 1 a lot. Everyone sharing the same 1 saves a lot of memory, but that means it must be immutable else adding to one object would change other objects and that would be bad.
Not having to constantly deallocate and allocate the same Integer over and over is faster and reduces memory fragmentation. If they're really unused, Ruby will garbage collect them. Ruby's garbage collection is constantly being improved and can be tweaked with many environment variables.
It also simplifies their implementation and makes them more performant. Immutable objects can cache any calculations confident the cache will never need to be invalidated.
Why are integers immutable in ruby? I've looked on stackoverflow but could find no explanation.
Let's assume that Integers were mutable:
1 << 1
1 + 1
#=> 4
I am guessing that the reason that you find no explanation is that most assume it is self-evident that the ability to change the value of 1 to be anything other than 1 would lead to an absolutely insane maintenance nightmare.
Think of the ability to mutate integers and what that would entail for the ruby programmer. It would give them too much power in changing how the language observes the world. An integer object is a sequence of binary numbers in memory and if you increment the object by one, is it still one? No. That's why Ruby uses variables to point to the object rather than to mutate it.
Immutable integers save on consumption in the long run. I think what you are questioning here is the decision of ruby to make the original object immutable and to store it in memory rather than mutate it, which increases memory consumption? Well, a is not a global variable and should be cleared by garbage collection eventually. Ruby uses DRY principles for object allocation and it makes sense to retain objects rather than create them over and over. If you have more objects than ruby can fit into memory, it must allocate more memory, which is expensive on the OS but your code should be cognisant of that. Ruby does free memory when there is too much allocated. Ruby objects are stored in Ruby object heap rather than malloc heap which the garbage collector cleans often.
If integers were mutable then modifying a variable that is used elsewhere in an application would cause hard-to-fix regressions and would cause a lot of issues in web applications. As for the concern of retention and performance, look at the following example:
require 'newrelic_rpm'
a = 0
loop do
a += 1
if a % 10_000 = 0
p a
p NewRelic::Agent::Samplers::MemorySampler.new.sampler.get_sample
end
end
Observe how many objects are 'kept' and how much memory is consumed by the process. Now, think of how many objects are going to be kept in a real world program and think if the memory consumption will be a real issue. If you do find yourself encountering this issue, to which I don't think you will, see it as a notification to improve your code.

Enumerator::Lazy and Garbage Collection

I am using Ruby's built in CSV parser against large files.
My approach is to separate the parsing with the rest of the logic. To achieve this I am creating an array of hashes. I also want to take advantage of Ruby's Enumerator:: Lazy to prevent loading the entire file in memory.
My question is, when I'm actually iterating through the array of hashes, does the Garbage collector actually clean things up as I go or will it only clean up when the entire array can be cleaned up, essentially still allowing the entire value in memory still?
I'm not asking if it will clean each element as I finish with it, only if it will clean it before the entire enum is actually evaluated.
When you iterate over a plain old array, the garbage collector has no chance to do anything.
You can help the garbage collector by writing nil into the array position after you no longer need the element, so that the object in this position may now be free for collection.
When you correctly use lazy enumerator, you are not iterate over an array of hashes. Instead you enumerate over the hashes, handling one after the other, and each one is read on demand.
So you have the chance to use much less memory (depending on your further processing, and that it does not hold the objects in memory anyway)
the structure may look like this:
enum = Enumerator.new do |yielder|
csv.read(...) do
...
yielder.yield hash
end
end
enum.lazy.map{|hash| do_something(hash); nil}.count
You also need to make sure that you are not generate the array again in the last step of the chain.

Speedy attribute lookup in dynamically typed language?

I'm currently developing a dynamically typed language.
One of the main problems I'm facing during development is how to do fast runtime symbol lookups.
For general, free global and local symbols I simply index them and let each scope (global or local) keep an array of the symbols and quickly look them up using the index. I'm very happy with this approach.
However, for attributes in objects the problem is much harder. I can't use the same indexing scheme on them, because I have no idea which object I'm currently accessing, thus I don't know which index to use!
Here's an example in python which reflects what I want working in my language:
class A:
def __init__(self):
self.a = 10
self.c = 30
class B:
def __init__(self):
self.c = 20
def test():
if random():
foo = A()
else:
foo = B()
# There could even be an eval here that sets foo
# to something different or removes attribute c from foo.
print foo.c
Does anyone know any clever tricks to do the lookup quickly? I know about hash maps and splay trees, so I'm interesting if there is any ways to do it as efficient as my other lookup.
Once you've reached the point where looking up properties in the hash table isn't fast enough, the standard next step is inline caching. You can do this in JIT languages, or even bytecode compilers or interpreters, though it seems to be less common there.
If the shape of your objects can change over time (i.e. you can add new properties at runtime) you'll probably end up doing something similar to V8's hidden classes.
A technique known as maps can store the values for each attribute in a compact array. The knowledge which attribute name corresponds to which index is maintained in an auxiliary data structure (the eponymous map), so you don't immediately gain a performance benefit (though it does use memory more efficiently if many objects share a set of attributes). With a JIT compiler, you can make the map persistent and constant-fold lookups, so the final machine code can use constant offsets into the attributes array (for constant attribute names).
In an interpreter (I'll assume byte code), things are much harder because you don't have much opportunity to specialize code for specific objects. However, I have an idea myself for turning attribute names into integral keys. Maintain a global mapping assigning integral IDs to attribute names. When adding new byte code to the VM (loading from disk or compiling in memory), scan for strings used as attributes, and replace them with the associated ID, creating a new ID if the string hasn't been seen before. Instead of storing hash tables or similar mappings on each object - or in the map, if you use maps - you can now use sparse arrays, which are hopefully more compact and faster to operate on.
I haven't had a change to implement and test this, and you still need a sparse array. Unless you want to make all objects (or maps) take as many words of memory as there are distinct attribute names in the whole program, that is. At least you can replace string hash tables with integer hash tables.
Just by tuning a hash table for IDs as keys, you can make several optimizations: Don't invoke a hash function (use the ID as hash), remove some indirection and hence cache misses, save yourself the complexity of dealing with pathologically bad hash functions, etc.

When to refactor string usage to a Symbol in Ruby

According to tryruby.org, the usage of a symbol uses one memory allocation, and then points to its single allocation, whereas storing multiple strings, even if they are the same, stores multiple instances in the memory. So, much like how MP3 and other compression or optimization methods work, what considerations go into switching from multiple strings to refactoring to usage of symbols to take advantage of the repetition? As soon as you have two duplicates? Only when you notice performance drops? A logarithmic calculation? Other considerations or viewpoints?
I am programmer interested in learning strong positive convention practice, that is why I am asking.
A symbol is basically an immutable, interned string. That means, it can't be changed in place (e.g. by using gsub!) and it is guaranteed that two usages of the same symbol always return the same object:
"foo".object_id == "foo".object_id
# => false
:foo.object_id == :foo.object_id
# => true
Because of that guarantee, symbols are never garbage collected. Once you have "created" a symbol, it will be kept forever in the current process.
Generally, you should use symbols when you have a static string, or at a least limited number of them, such as keys in hashes or to reference methods. Using symbols here ensure that you are always getting the same object back.
With ordinary strings, depending on how you compare them, it is possible that you get different objects back. With ordinary strings it is possible to have two or more that look like they are the same, but are actually not the same (see the example above).
When required by an input or output of the program, then use strings.
When it is used only internal to the program, and belongs to a relatively small closed set, such as flags, then use symbols.

Resources