Ruby: Why does `#hash` need to overridden whenever `#eql?` is overridden? - ruby

In this presentation the speaker has created a value class.
In implementing it, he overrides #eql? and says that in Java development, the idiom is that whenever you override #eql? you must override #hash.
class Weight
# ...
def hash
pounds.hash
end
def eql?(other)
self.class == other.class &&
self.pounds == other.pounds
end
alias :== eql?
end
Firstly, what is the #hash method? I can see it returns an integer.
> 1.hash
=> -3708808305943022538
> 2.hash
=> 1196896681607723080
> 1.hash
=> -3708808305943022538
Using pry I can see that an integer responds to #hash but I cannot see where it inherits the method from. It's not defined on Numeric or Object. If I knew what this method did, I would probably understand why it needs to be overridden at the same time as #eql?.
So, why does #hash need to be overridden whenever eql? is overridden?

Firstly, what is the #hash method? I can see it returns an integer.
The #hash method is supposed to return a hash of the receiver. (The name of the method is a bit of a giveaway).
Using pry I can see that an integer responds to #hash but I cannot see where it inherits the method from.
There are dozens of questions of the type "Where does this method come from" on [so], and the answer is always the same: the best way to know where a method comes from, is to simply ask it:
hash_method = 1.method(:hash)
hash_method.owner #=> Kernel
So, #hash is inherited from Kernel. Note however, that there is a bit of a peculiar relationship between Object and Kernel, in that some methods that are implemented in Kernel are documented in Object or vice versa. This probably has historic reasons, and is now an unfortunate fact of life in the Ruby community.
Unfortunately, for reasons I don't understand, the documentation for Object#hash was deleted in 2017 in a commit ironically titled "Add documents". It is, however, still available in Ruby 2.4 (bold emphasis mine):
hash → integer
Generates an Integer hash value for this object. This function must have the property that a.eql?(b) implies a.hash == b.hash.
The hash value is used along with eql? by the Hash class to determine if two objects reference the same hash key. […]
So, as you can see, there is a deep and important relationship between #eql? and #hash, and in fact the correct behavior of methods that use #eql? and #hash depends on the fact that this relationship is maintained.
So, we know that the method is called #hash and thus likely computes a hash. We know it is used together with eql?, and we know that it is used in particular by the Hash class.
What does it do, exactly? Well, we all know what a hash function is: it is a function that maps a larger, potentially infinite, input space into a smaller, finite, output space. In particular, in this case, the input space is the space of all Ruby objects, and the output space is the "fast integers" (i.e. the ones that used to be called Fixnum).
And we know how a hash table works: values are placed in buckets based on the hash of their keys, if I want to find a value, then I only need to compute the hash of the key (which is fast) and know which bucket I find the value in (in constant time), as opposed to e.g. an array of key-value-pairs, where I need to compare the key against every key in the array (linear search) to find the value.
However, there is a problem: Since the output space of a hash is smaller than the input space, there are different objects which have the same hash value and thus end up in the same bucket. Thus, when two objects have different hash values, I know for a fact that they are different, but if they have the same hash value, then they could still be different, and I need to compare them for equality to be sure – and that's where the relationship between hash and equality comes from. Also note that when many keys and up in the same bucket, I will again have to compare the search key against every key in the bucket (linear search) to find the value.
From all this we can conclude the following properties of the #hash method:
It must return an Integer.
Not only that, it must return a "fast integer" (equivalent to the old Fixnums).
It must return the same integer for two objects that are considered equal.
It may return the same integer for two objects that are considered unequal.
However, it only should do so with low probability. (Otherwise, a Hash may degenerate into a linked list with highly degraded performance.)
It also should be hard to construct objects that are unequal but have the same hash value deliberately. (Otherwise, an attacker can force a Hash to degenerate into a linked list as a form of Degradation-of-Service attack.)

The #hash method returns a numeric hash value for the receiving object:
:symbol.hash # => 2507
Ruby Hashes are an implementation of the hash map data structure, and they use the value returned by #hash to determine if the same key is being referenced.
Hashes leverage the #eql? method in conjunction with #hash values to determine equality.
Given that these two methods work together to provide Hashes with information about equality, if you override #eql?, you need to also override #hash to keep your object's behavior consistent with other Ruby objects.
If you do NOT override it, this happens:
class Weight
attr_accessor :pounds
def eql?(other)
self.class == other.class && self.pounds == other.pounds
end
alias :== eql?
end
w1 = Weight.new
w2 = Weight.new
w1.pounds = 10
w2.pounds = 10
w1 == w2 # => true, these two objects should now be considered equal
weights_map = Hash.new
weights_map[w1] = '10 pounds'
weights_map[w2] = '10 pounds'
weights_map # => {#<Weight:0x007f942d0462f8 #pounds=10>=>"10 pounds", #<Weight:0x007f942d03c3c0 #pounds=10>=>"10 pounds"}
If w1 and w2 are considered equal, there should only be one key value pair in the hash. However, the Hash class is calling #hash which we did NOT override.
To fix this and truly make w1 and w2 equals, we override #hash to:
class Weight
def hash
pounds.hash
end
end
weights_map = Hash.new
weights_map[w1] = '10 pounds'
weights_map[w2] = '10 pounds'
weights_map # => {#<Weight:0x007f942d0462f8 #pounds=10>=>"10 pounds"}
Now hash knows these objects are equal and therefore stores only one key-value pair

Related

How exactly does "#eql?" rely on "#hash"?

The Ruby docs read as follows:
The eql? method returns true if obj and other refer to the same hash key.
So in order to use #eql? to compare two objects (or use objects as Hash keys), the object has to implement #hash in a meaningful manner.
How come the following happens?
class EqlTest
def hash
123
end
end
a = EqlTest.new
b = EqlTest.new
a.hash == b.hash # => true
a.eql? b # => false
I could of course implement EqlTest#eql? but shouldn't the implementation inherited from Object be something along the lines of hash == other.hash already?
Thanks for your hints!
This seems to be actually the other way around. eql? is expected to return true for objects returning the same hash value, but it is not defined to compare these values. You are simply expected to override both.
The eql? method returns true if obj and other refer to the same hash key. This is used by Hash to test members for equality. For any pair of objects where eql? returns true, the hash value of both objects must be equal. So any subclass that overrides eql? should also override hash appropriately.

Ruby internals and how to guarantee unique hash values

Hash in Ruby just uses its hash value (for strings and numbers). Internally, it uses the Murmur hash function. I wonder how it can can be done given that the probability of having the same hash value for two different keys is not zero.
Can you share with us how you came to the conclusion that Ruby uses only the hash value to determine equality?
The text below is to explain to others your excellent point that the probability of computing the same hash value for two different keys is not zero, so how can the Hash class rely on just the hash value to determine equality?
For the purpose of this discussion I will refer to Ruby hashes as maps, so as not to confuse the 2 uses of the term hash in the Ruby language (1, a computed value on an object, and 2, a map/dictionary of pairs of values and unique keys).
As I understand it, hash values in maps, sets, etc. are used as a quick first pass at determining possible equality. That is, if the hashes of 2 objects are equal, then it is possible that the 2 objects are equal; but it's also possible that the 2 objects are not equal, but coincidentally produce the same hash value.
In other words, the only sure thing you can tell about equality from the hash values of the objects being compared is that if hash1 != hash2 then the objects are definitely not equal.
If the 2 hashes are equal, then the 2 objects must be compared by their content (in Ruby, by calling the == method, I believe).
So comparing hashes is not a substitute for comparing the objects themselves, it's just a quick first pass used to optimize performance.
Remember that a "hash table" or dictionary is perfectly okay with collisions. In fact, it's expected and accommodated in any reasonable implementation.
Ideally you strive to have a hash with as few collisions as possible, and there are entire doctoral level discussions on what makes a good hashing function, but they are inevitable. When a collision does occur then two values share the same index in the container.
Regardless of how a value is hashed, any potential match based on hash must be evaluated. A direct comparison is performed to ensure that the value you're accessing is the one requested, not one that coincidentally maps to the same spot.
Normal hash tables can be thought of as an array of arrays even though this is all completely hidden from you in general purpose use.
You can implement your own hash table in Ruby if you want to explore how this behaves:
class ExampleHash
include Enumerable
def initialize
#size = 9
#slots = Array.new(#size) { [ ] }
end
def [](key)
#slots[key.hash % #size].each do |entry|
if (entry[0] == key)
return entry[1]
end
end
nil
end
def []=(key, value)
entries = #slots[key.hash % #size]
entries.each do |entry|
if (entry[0] == key)
entry[1] = value
return
end
end
entries << [ key, value ]
end
end
This is made easy since every object in Ruby has a built-in hash method that produces a large numerical value that's based on the object's content.

undefined method `assoc' for #<Hash:0x10f591518> (NoMethodError)

I'm trying to return a list of values based on user defined arguments, from hashes defined in the local environment.
def my_method *args
#initialize accumulator
accumulator = Hash.new(0)
#define hashes in local environment
foo=Hash["key1"=>["var1","var2"],"key2"=>["var3","var4","var5"]]
bar=Hash["key3"=>["var6"],"key4"=>["var7","var8","var9"],"key5"=>["var10","var11","var12"]]
baz=Hash["key6"=>["var13","var14","var15","var16"]]
#iterate over args and build accumulator
args.each do |x|
if foo.has_key?(x)
accumulator=foo.assoc(x)
elsif bar.has_key?(x)
accumulator=bar.assoc(x)
elsif baz.has_key?(x)
accumulator=baz.assoc(x)
else
puts "invalid input"
end
end
#convert accumulator to list, and return value
return accumulator = accumulator.to_a {|k,v| [k].product(v).flatten}
end
The user is to call the method with arguments that are keywords, and the function to return a list of values associated with each keyword received.
For instance
> my_method(key5,key6,key1)
=> ["var10","var11","var12","var13","var14","var15","var16","var1","var2"]
The output can be in any order. I received the following error when I tried to run the code:
undefined method `assoc' for #<Hash:0x10f591518> (NoMethodError)
Please would you point me how to troubleshoot this? In Terminal assoc performs exactly how I expect it to:
> foo.assoc("key1")
=> ["var1","var2"]
I'm guessing you're coming to Ruby from some other language, as there is a lot of unnecessary cruft in this method. Furthermore, it won't return what you expect for a variety of reasons.
`accumulator = Hash.new(0)`
This is unnecessary, as (1), you're expecting an array to be returned, and (2), you don't need to pre-initialize variables in ruby.
The Hash[...] syntax is unconventional in this context, and is typically used to convert some other enumerable (usually an array) into a hash, as in Hash[1,2,3,4] #=> { 1 => 2, 3 => 4}. When you're defining a hash, you can just use the curly brackets { ... }.
For every iteration of args, you're assigning accumulator to the result of the hash lookup instead of accumulating values (which, based on your example output, is what you need to do). Instead, you should be looking at various array concatenation methods like push, +=, <<, etc.
As it looks like you don't need the keys in the result, assoc is probably overkill. You would be better served with fetch or simple bracket lookup (hash[key]).
Finally, while you can call any method in Ruby with a block, as you've done with to_a, unless the method specifically yields a value to the block, Ruby will ignore it, so [k].product(v).flatten isn't actually doing anything.
I don't mean to be too critical - Ruby's syntax is extremely flexible but also relatively compact compared to other languages, which means it's easy to take it too far and end up with hard to understand and hard to maintain methods.
There is another side effect of how your method is constructed wherein the accumulator will only collect the values from the first hash that has a particular key, even if more than one hash has that key. Since I don't know if that's intentional or not, I'll preserve this functionality.
Here is a version of your method that returns what you expect:
def my_method(*args)
foo = { "key1"=>["var1","var2"],"key2"=>["var3","var4","var5"] }
bar = { "key3"=>["var6"],"key4"=>["var7","var8","var9"],"key5"=>["var10","var11","var12"] }
baz = { "key6"=>["var13","var14","var15","var16"] }
merged = [foo, bar, baz].reverse.inject({}, :merge)
args.inject([]) do |array, key|
array += Array(merged[key])
end
end
In general, I wouldn't define a method with built-in data, but I'm going to leave it in to be closer to your original method. Hash#merge combines two hashes and overwrites any duplicate keys in the original hash with those in the argument hash. The Array() call coerces an array even when the key is not present, so you don't need to explicitly handle that error.
I would encourage you to look up the inject method - it's quite versatile and is useful in many situations. inject uses its own accumulator variable (optionally defined as an argument) which is yielded to the block as the first block parameter.

Why is a string key for a hash frozen?

According to the specification, strings that are used as a key to a hash are duplicated and frozen. Other mutable objects do not seem to have such special consideration. For example, with an array key, the following is possible.
a = [0]
h = {a => :a}
h.keys.first[0] = 1
h # => {[1] => :a}
h[[1]] # => nil
h.rehash
h[[1]] # => :a
On the other hand, a similar thing cannot be done with a string key.
s = "a"
h = {s => :s}
h.keys.first.upcase! # => RuntimeError: can't modify frozen String
Why is string designed to be different from other mutable objects when it comes to a hash key? Is there any use case where this specification becomes useful? What other consequences does this specification have?
I actually have a use case where absence of such special specification about strings may be useful. That is, I read with the yaml gem a manually written YAML file that describes a hash. the keys may be strings, and I would like to allow case insensitivity in the original YAML file. When I read a file, I might get a hash like this:
h = {"foo" => :foo, "Bar" => :bar, "BAZ" => :baz}
And I want to normalize the keys to lower case to get this:
h = {"foo" => :foo, "bar" => :bar, "baz" => :baz}
by doing something like this:
h.keys.each(&:downcase!)
but that returns an error for the reason explained above.
In short it's just Ruby trying to be nice.
When a key is entered in a Hash, a special number is calculated, using the hash method of the key. The Hash object uses this number to retrieve the key. For instance, if you ask what the value of h['a'] is, the Hash calls the hash method of string 'a' and checks if it has a value stored for that number. The problem arises when someone (you) mutates the string object, so the string 'a' is now something else, let's say 'aa'. The Hash would not find a hash number for 'aa'.
The most common types of keys for hashes are strings, symbols and integers. Symbols and integers are immutable, but strings are not. Ruby tries to protect you from the confusing behaviour described above by dupping and freezing string keys. I guess it's not done for other types because there could be nasty performance side effects (think of large arrays).
Immutable keys make sense in general because their hash codes will be stable.
This is why strings are specially-converted, in this part of MRI code:
if (RHASH(hash)->ntbl->type == &identhash || rb_obj_class(key) != rb_cString) {
st_insert(RHASH(hash)->ntbl, key, val);
}
else {
st_insert2(RHASH(hash)->ntbl, key, val, copy_str_key);
}
In a nutshell, in the string-key case, st_insert2 is passed a pointer to a function that will trigger the dup and freeze.
So if we theoretically wanted to support immutable lists and immutable hashes as hash keys, then we could modify that code to something like this:
VALUE key_klass;
key_klass = rb_obj_class(key);
if (key_klass == rb_cArray || key_klass == rb_cHash) {
st_insert2(RHASH(hash)->ntbl, key, val, freeze_obj);
}
else if (key_klass == rb_cString) {
st_insert2(RHASH(hash)->ntbl, key, val, copy_str_key);
}
else {
st_insert(RHASH(hash)->ntbl, key, val);
}
Where freeze_obj would be defined as:
static st_data_t
freeze_obj(st_data_t obj)
{
return (st_data_t)rb_obj_freeze((VALUE) obj);
}
So that would solve the specific inconsistency that you observed, where the array-key was mutable. However to be really consistent, more types of objects would need to be made immutable as well.
Not all types, however. For example, there'd be no point to freezing immediate objects like Fixnum because there is effectively only one instance of Fixnum corresponding to each integer value. This is why only String needs to be special-cased this way, not Fixnum and Symbol.
Strings are a special exception simply as a matter of convenience for Ruby programmers, because strings are very often used as hash keys.
Conversely, the reason that other object types are not frozen like this, which admittedly leads to inconsistent behavior, is mostly a matter of convenience for Matz & Company to not support edge cases. In practice, comparatively few people will use a container object like an array or a hash as a hash key. So if you do so, it's up to you to freeze before insertion.
Note that this is not strictly about performance, because the act of freezing a non-immediate object simply involves flipping the FL_FREEZE bit on the basic.flags bitfield that's present on every object. That's of course a cheap operation.
Also speaking of performance, note that if you are going to use string keys, and you are in a performance-critical section of code, you might want to freeze your strings before doing the insertion. If you don't, then a dup is triggered, which is a more-expensive operation.
Update #sawa pointed out that leaving your array-key simply frozen means the original array might be unexpectedly immutable outside of the key-use context, which could also be an unpleasant surprise (although otoh it would serve you right for using an array as a hash-key, really). If you therefore surmise that dup + freeze is the way out of that, then you would in fact incur possible noticeable performance cost. On the third hand, leave it unfrozen altogether, and you get the OP's original weirdness. Weirdness all around. Another reason for Matz et al to defer these edge cases to the programmer.
See this thread on the ruby-core mailing list for an explanation (freakily, it happened to be the first mail I stumbled across when I opened up the mailing list in my mail app!).
I've no idea about the first part of your question, but hHere is a practical answer for the 2nd part:
new_hash = {}
h.each_pair do |k,v|
new_hash.merge!({k.downcase => v})
end
h.replace new_hash
There's lots of permutations of this kind of code,
Hash[ h.map{|k,v| [k.downcase, v] } ]
being another (and you're probably aware of these, but sometimes it's best to take the practical route:)
You are askin 2 different questions: theoretical and practical. Lain was the first to answer, but I would like to provide what I consider a proper, lazier solution to your practical question:
Hash.new { |hsh, key| # this block get's called only if a key is absent
downcased = key.to_s.downcase
unless downcased == key # if downcasing makes a difference
hsh[key] = hsh[downcased] if hsh.has_key? downcased # define a new hash pair
end # (otherways just return nil)
}
The block used with Hash.new constructor is only invoked for those missing keys, that are actually requested. The above solution also accepts symbols.
A very old question - but if anyone else is trying to answer the "how can I get around the hash keys are freezing strings" part of the question...
A simple trick you could do to solve the String special case is:
class MutableString < String
end
s = MutableString.new("a")
h = {s => :s}
h.keys.first.upcase! # => RuntimeError: can't modify frozen String
puts h.inspect
Doesn't work unless you are creating the keys, and unless you are then careful that it doesn't cause any problems with anything that strictly requires that the class is exactly "String"

How to make object instance a hash key in Ruby?

I have a class Foo with a few member variables. When all values in two instances of the class are equal I want the objects to be 'equal'. I'd then like these objects to be keys in my hash. When I currently try this, the hash treats each instance as unequal.
h = {}
f1 = Foo.new(a,b)
f2 = Foo.new(a,b)
f1 and f2 should be equal at this point.
h[f1] = 7
h[f2] = 8
puts h[f1]
should print 8
See http://ruby-doc.org/core/classes/Hash.html
Hash uses key.eql? to test keys for
equality. If you need to use instances
of your own classes as keys in a Hash,
it is recommended that you define both
the eql? and hash methods. The hash
method must have the property that
a.eql?(b) implies a.hash == b.hash.
The eql? method is easy to implement: return true if all member variables are the same. For the hash method, use [#data1, #data2].hash as Marc-Andre suggests in the comments.
Add a method called 'hash' to your class:
class Foo
def hash
return whatever_munge_of_instance_variables_you_like
end
end
This will work the way you requested and won't generate different hash keys for different, but identical, objects.

Resources