Various ways of creating Objects in Ruby - ruby

Are these methods of creating an empty Ruby Hash different? If so how?
myHash = Hash.new
myHash = {}
I'd just like a solid understanding of memory management in Ruby.

There are many ways you can create a Hash object in Ruby, though the end result is the same sort of object:
hash = { }
hash = Hash.new
hash = Hash[]
hash = some_object.to_h
hash = YAML.load("--- {}\n\n")
As far as memory considerations go, an empty Hash is significantly smaller than one with even a singular value in it. Arrays tend to be smaller than Hashes at small sizes, but will be more efficient at larger scales.
In practice, though, the important thing to remember in Ruby is that every time you create an object it costs you something, even if it's only an infinitesimal amount. These little hits add up if you're needlessly creating billions of objects.
Generally you should avoid creating structures that will not be used, and instead create them on demand if that wouldn't complicate things needlessly. For example, a typical pattern is:
def cache
#cache ||= { }
end
Until this method is called, the cache Hash is never defined. The memory savings in this instance is nearly insignificant, but if that was loading a large configuration file or importing several hundred MB of data from a database you can imagine the savings would be significant in those instances where that data is not exercised.

The two methods are exactly equivalent.

As is mentioned above, the two are operationally equivalent. If you're referring to the standard MRI / YARV; perhaps this thread would help: http://www.ruby-forum.com/topic/215163#new.

With the Hash.new syntax you can specify what to do when some key is absent in the hash (the default behaviour). With the the {} syntax it takes another step.
my_hash = Hash[]
is another way of creating an array; the [] methods takes an even number of arguments.
my_hash = Hash[:a, 1, :b, 2]
This has nothing to do with memory management.

Related

What is Ruby's equivalent of Python's hash()?

Suppose I have an Array: ['a', 'b', 'c']. I want to record whether I have seen a particular array before.
I can put the array in a Set, but that is wasteful if I don't need to store the contents of the array, only that I have seen it before.
In Python, I could hash a tuple (i.e. hash(('a', 'b', 'c'))) and store the result in a set to achieve this. What is the way to do this in Ruby?
Ruby has #hash on most objects, including Array, but these values are not unique and will eventually collide.
For any serious use I'd strongly suggest using something like SHA2-256 or stronger as these are cryptographic hashes designed to minimize collisions.
For example:
require 'digest/sha2'
array = %w[ a b c ]
array.hash
# => 3218529217224510043
Digest::SHA2.hexdigest(array.inspect)
# => "ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad"
Where that value is going to be relatively unique. SHA2-256 collisions are really infrequent due to the sheer size of that hash, 256 bits vs. the 64 bit #hash value. That's not 4x stronger, it's 6.2 octodecillion times stronger. That number may as well be a "zillion" given how it has 57 zeroes in it.

Enumerator::Lazy and Garbage Collection

I am using Ruby's built in CSV parser against large files.
My approach is to separate the parsing with the rest of the logic. To achieve this I am creating an array of hashes. I also want to take advantage of Ruby's Enumerator:: Lazy to prevent loading the entire file in memory.
My question is, when I'm actually iterating through the array of hashes, does the Garbage collector actually clean things up as I go or will it only clean up when the entire array can be cleaned up, essentially still allowing the entire value in memory still?
I'm not asking if it will clean each element as I finish with it, only if it will clean it before the entire enum is actually evaluated.
When you iterate over a plain old array, the garbage collector has no chance to do anything.
You can help the garbage collector by writing nil into the array position after you no longer need the element, so that the object in this position may now be free for collection.
When you correctly use lazy enumerator, you are not iterate over an array of hashes. Instead you enumerate over the hashes, handling one after the other, and each one is read on demand.
So you have the chance to use much less memory (depending on your further processing, and that it does not hold the objects in memory anyway)
the structure may look like this:
enum = Enumerator.new do |yielder|
csv.read(...) do
...
yielder.yield hash
end
end
enum.lazy.map{|hash| do_something(hash); nil}.count
You also need to make sure that you are not generate the array again in the last step of the chain.

Speedy attribute lookup in dynamically typed language?

I'm currently developing a dynamically typed language.
One of the main problems I'm facing during development is how to do fast runtime symbol lookups.
For general, free global and local symbols I simply index them and let each scope (global or local) keep an array of the symbols and quickly look them up using the index. I'm very happy with this approach.
However, for attributes in objects the problem is much harder. I can't use the same indexing scheme on them, because I have no idea which object I'm currently accessing, thus I don't know which index to use!
Here's an example in python which reflects what I want working in my language:
class A:
def __init__(self):
self.a = 10
self.c = 30
class B:
def __init__(self):
self.c = 20
def test():
if random():
foo = A()
else:
foo = B()
# There could even be an eval here that sets foo
# to something different or removes attribute c from foo.
print foo.c
Does anyone know any clever tricks to do the lookup quickly? I know about hash maps and splay trees, so I'm interesting if there is any ways to do it as efficient as my other lookup.
Once you've reached the point where looking up properties in the hash table isn't fast enough, the standard next step is inline caching. You can do this in JIT languages, or even bytecode compilers or interpreters, though it seems to be less common there.
If the shape of your objects can change over time (i.e. you can add new properties at runtime) you'll probably end up doing something similar to V8's hidden classes.
A technique known as maps can store the values for each attribute in a compact array. The knowledge which attribute name corresponds to which index is maintained in an auxiliary data structure (the eponymous map), so you don't immediately gain a performance benefit (though it does use memory more efficiently if many objects share a set of attributes). With a JIT compiler, you can make the map persistent and constant-fold lookups, so the final machine code can use constant offsets into the attributes array (for constant attribute names).
In an interpreter (I'll assume byte code), things are much harder because you don't have much opportunity to specialize code for specific objects. However, I have an idea myself for turning attribute names into integral keys. Maintain a global mapping assigning integral IDs to attribute names. When adding new byte code to the VM (loading from disk or compiling in memory), scan for strings used as attributes, and replace them with the associated ID, creating a new ID if the string hasn't been seen before. Instead of storing hash tables or similar mappings on each object - or in the map, if you use maps - you can now use sparse arrays, which are hopefully more compact and faster to operate on.
I haven't had a change to implement and test this, and you still need a sparse array. Unless you want to make all objects (or maps) take as many words of memory as there are distinct attribute names in the whole program, that is. At least you can replace string hash tables with integer hash tables.
Just by tuning a hash table for IDs as keys, you can make several optimizations: Don't invoke a hash function (use the ID as hash), remove some indirection and hence cache misses, save yourself the complexity of dealing with pathologically bad hash functions, etc.

Performance of Sets V.S. Arrays in Ruby

In Ruby, I am building a method which constructs and returns a (probably large) array which should contain no duplicate elements. Would I get better performance by using a set and then converting that to an array? Or would it be better to just call .uniq on the array I am using before I return it? Or what about using & to append items to the array instead of +=? And if I do use a set, would not having a <=> method on the object I am putting into the set have an effect on performance? (If you're not sure, do you know of a way to test this?)
The real answer is: write the most readable and maintainable code, and optimize it only after you've shown it is a bottleneck. If you can find an algorithm in that is in linear time, you won't have to optimize it. Here it's easy to find...
Not quite sure which methods you are suggesting, but using my fruity gem:
require 'fruity'
require 'set'
enum = 1000.times
compare do
uniq { enum.each_with_object([]){|x, array| array << x}.uniq }
set { enum.each_with_object(Set[]){|x, set| set << x}.to_a }
join { enum.inject([]){|array, x| array | [x]} }
end
# set is faster than uniq by 10.0% ± 1.0%
# uniq is faster than join by 394x ± 10.0
Clearly, it makes no sense building intermediate arrays like in the third method. Otherwise, it's not going to make a big difference since you will be in O(n); that's the main thing.
BTW, both sets, uniq and Array#| use eql? and hash on your objects, not <=>. These need to be defined in a sane manner, because the default is that objects are never eql? unless they have the same object_id (see this question)
Have you tried using the Benchmark library? Tests are usually very easy to construct and will properly reflect how it works in your particular version of Ruby.

Performance using the Ruby 'include?' method

I would like to know how much the include? method can affect the performance if I have something like this:
array = [<array_values>] # Read above for more information
(0..<n_iterations>).each { |value| # Read above for more information
array.include?(value)
}
In cases <array_values> are 10, 100 and 1.000 and <n_iterations> are 10, 100, 1000.
Use a Set (or equivalently a Hash) instead of an array, so that include is O(1) instead of O(n).
Or if you have multiple include to do, you can use array intersection & or subtraction - which will build a temporary Hash to do the operation efficiently.
I think ruby-prof might be a good place to start. However, that performance data won't be useful without something else to compare this to. As in, "is the performance of this method better or worse than [some other method]?"
Also, note that as n_iterations increases larger than the size of array, this code will probably perform better, due to sheer number of #include? calls.
array.each do |value|
(0..<n_iterations>).map.include?(value)
end

Resources