Ruby hashes, duck typing and JSON deserialization - ruby

I am pretty new to Ruby and currently discovering its differences from Java, consider the following code snippet:
file = File.new('test.json', 'w')
hash = {}
hash['1234'] = 'onetwothreefour_str'
hash[1234] = 'onetwothreefour_num'
puts hash.to_json
file.write(hash.to_json)
file.close
str = File.read('test.json')
puts str
puts JSON.parse(str)
it outputs
{"1234":"onetwothreefour_str","1234":"onetwothreefour_num"}
{"1234":"onetwothreefour_str","1234":"onetwothreefour_num"}
{"1234"=>"onetwothreefour_num"}
so, after deserialization we have one less object in hash.
Now, the question - is it normal behaviour? I think that it is perfectly legal to store in hash keys of different types. If so, then shouldn't JSON.parse write to file keys as '1234' and 1234?
Just to be clear - I understand that it's better to have keys of the same type, I just saw that after restoring my object has them as strings instead of numbers.

Yes, ruby hashes can have keys of whatever type.
JSON spec, on the other hand, dictates that object keys must be strings, no other type allowed.
So that explains the output you observe: upon serializing, integer key is turned into a string, making it a duplicate of another key. When reading it back, duplicate keys are dropped (last one wins, IIRC). I'm pretty sure you would get the same behaviour if you tried to use that json from javascript.

Related

Is there a similar solution as Array#wrap for hashes?

I use Array.wrap(x) all the time in order to ensure that Array methods actually exist on an object before calling them.
What is the best way to similarly ensure a Hash?
Example:
def ensure_hash(x)
# TODO: this is what I'm looking for
end
values = [nil,1,[],{},'',:a,1.0]
values.all?{|x| ensure_hash(x).respond_to?(:keys) } # true
The best I've been able to come up with so far is:
Hash::try_convert(x) || {}
However, I would prefer something more elegant.
tl; dr: In an app with proper error handling, there is no "easy, care-free" way to handle something that may or may not be hashy.
From a conceptual standpoint, the answer is no. There is no similar solution as Array.wrap(x) for hashes.
An array is a collection of values. Single values can be stored outside of arrays (e.g. x = 42) , so it's a straight-forward task to wrap a value in an array (a = [42]).
A hash is a collection of key-value pairs. In ruby, single key-value pairs can't exist outside of a hash. The only way to express a key-value pair is with a hash: h = { v: 42 }
Of course, there are a thousand ways to express a key-value pair as a single value. You could use an array [k, v] or a delimited string `"k:v" or some more obscure method.
But at that point, you're no longer wrapping, you're parsing. Parsing relies on properly formatted data and has multiple points of failure. No matter how you look at it, if you find yourself in a situation where you may or may not have a hash, that means you need to write a proper chunk of code for data validation and parsing (or refactor your upstream code so that you can always expect a hash).

Comparing YAML files in ruby without loading

I am looking for a way to compare (and, best case, also diff) two YAML files in Ruby; regardless of key order, naturally. So far all solutions I found depended on loading the files with YAML::load_file(). I cannot do that, however, because the files are dumps of Ruby objects whose class declarations I do not have, so that loading them throws undefined class/module.
I think I need to load them as string hashes and compare that, but how do I tell Ruby to ignore the type information and just include it into the comparison?
Based on comments: I'm basically interested in text-based comparison, but it must be aware of the "depth" of the data structure. For instance this is an excerpt from one of the files I have:
attributes: !ruby/hash:Example::Attributes
!binary "b2NjaQ==": !ruby/hash:Example::Attributes
!binary "Y29yZQ==": !ruby/hash:Example::Attributes
!binary "aWQ=": !ruby/object:Example::Properties
type: string
required: false
mutable: false
!binary "dGl0bGU=": !ruby/object:Example::Properties
type: string
required: false
mutable: false
So the comparison must be able to identify a match even if the two attributes are in reverse order.
Psych, Ruby’s Yaml parser, provides several ways to examine Yaml data. The highest level loads the Yaml and provides a Ruby data structure. This is the API that looks at the Yaml tags and tries to load the appropriate Ruby classes, which is causing your problems. It also looks at the format of the data and converts it to various types (e.g. Dates) if it matches.
The next level will parse the Yaml and provide you with an AST containing the “raw” Yaml data. The high level api works by first parsing to this AST and then traversing it using the visitor pattern to create Ruby data (normally a Hash or Array). Unfortunately it doesn’t provide anything in between these two levels, but it is fairly easy to create a parser that creates a simplified data structure.
At its core Yaml data basically consists of scalars (which are basically strings), mappings (hashes) and sequences (arrays) – all of which can have a tag associated with them. The AST provided by Psych consists of these three types (and a couple of others), and we can create our own visitor that traverses it and produces a Ruby structure that consists solely of hashes, arrays and strings.
This is loosely based on the Psych ToRuby visitor class, but instead of trying to convert the data to the appropriate Ruby type it only creates arrays, hashes and strings, throwing away any data in tags:
require 'psych'
class ToPlain < Psych::Visitors::Visitor
# Scalars are just strings.
def visit_Psych_Nodes_Scalar o
o.value
end
# Sequences are arrays.
def visit_Psych_Nodes_Sequence o
o.children.each_with_object([]) do |child, list|
list << accept(child)
end
end
# Mappings are hashes.
def visit_Psych_Nodes_Mapping o
o.children.each_slice(2).each_with_object({}) do |(k,v), h|
h[accept(k)] = accept(v)
end
end
# We also need to handle documents...
def visit_Psych_Nodes_Document o
accept o.root
end
# ... and streams.
def visit_Psych_Nodes_Stream o
o.children.map { |c| accept c }
end
# Aliases aren't handles here :-(
def visit_Psych_Nodes_Alias o
# Not implemented!
end
end
(Note this doesn’t handle aliases. It’s not too difficult to add support for them, have a look at what ToRuby does, in particular the register method and how it’s used.)
You can make use of this like this:
# Could also use parse_stream or parse_file here
ast = YAML.parse(my_data)
data = ToPlain.new.accept(ast)
# data now consists of just arrays, hashes and strings
If you use this on your example data, the result is a hash that looks something like this:
{
"attributes"=>{
"b2NjaQ=="=>{
"Y29yZQ=="=>{
"aWQ="=>{
"type"=>"string",
"required"=>"false",
"mutable"=>"false"
},
"dGl0bGU="=>{
"type"=>"string",
"required"=>"false",
"mutable"=>"false"
}
}
}
}
}
Whilst the keys are little unwieldy because you are using binary data, you can still make comparisons like this:
occi_core_id = data["attributes"]["b2NjaQ=="]["Y29yZQ=="]["aWQ="]
occi_core_title = data["attributes"]["b2NjaQ=="]["Y29yZQ=="]["dGl0bGU="]
puts occi_core_id == occi_core_title

Why is :key.hash != 'key'.hash in Ruby?

I'm learning Ruby right now for the Rhodes mobile application framework and came across this problem: Rhodes' HTTP client parses JSON responses into Ruby data structures, e.g.
puts #params # prints {"body"=>{"results"=>[]}}
Since the key "body" is a string here, my first attempt #params[:body] failed (is nil) and instead it must be #params['body']. I find this most unfortunate.
Can somebody explain the rationale why strings and symbols have different hashes, i.e. :body.hash != 'body'.hash in this case?
Symbols and strings serve two different purposes.
Strings are your good old familiar friends: mutable and garbage-collectable. Every time you use a string literal or #to_s method, a new string is created. You use strings to build HTML markup, output text to screen and whatnot.
Symbols, on the other hand, are different. Each symbol exists only in one instance and it exists always (i.e, it is not garbage-collected). Because of that you should make new symbols very carefully (String#to_sym and :'' literal). These properties make them a good candidate for naming things. For example, it's idiomatic to use symbols in macros like attr_reader :foo.
If you got your hash from an external source (you deserialized a JSON response, for example) and you want to use symbols to access its elements, then you can either use HashWithIndifferentAccess (as others pointed out), or call helper methods from ActiveSupport:
require 'active_support/core_ext'
h = {"body"=>{"results"=>[]}}
h.symbolize_keys # => {:body=>{"results"=>[]}}
h.stringify_keys # => {"body"=>{"results"=>[]}}
Note that it'll only touch top level and will not go into child hashes.
Symbols and Strings are never ==:
:foo == 'foo' # => false
That's a (very reasonable) design decision. After all, they have different classes, methods, one is mutable the other isn't, etc...
Because of that, it is mandatory that they are never eql?:
:foo.eql? 'foo' # => false
Two objects that are not eql? typically don't have the same hash, but even if they did, the Hash lookup uses hash and then eql?. So your question really was "why are symbols and strings not eql?".
Rails uses HashWithIndifferentAccess that accesses indifferently with strings or symbols.
In Rails, the params hash is actually a HashWithIndifferentAccess rather than a standard ruby Hash object. This allows you to use either strings like 'action' or symbols like :action to access the contents.
You will get the same results regardless of what you use, but keep in mind this only works on HashWithIndifferentAccess objects.
Copied from : Params hash keys as symbols vs strings

Why use symbols as hash keys in Ruby?

A lot of times people use symbols as keys in a Ruby hash.
What's the advantage over using a string?
E.g.:
hash[:name]
vs.
hash['name']
TL;DR:
Using symbols not only saves time when doing comparisons, but also saves memory, because they are only stored once.
Ruby Symbols are immutable (can't be changed), which makes looking something up much easier
Short(ish) answer:
Using symbols not only saves time when doing comparisons, but also saves memory, because they are only stored once.
Symbols in Ruby are basically "immutable strings" .. that means that they can not be changed, and it implies that the same symbol when referenced many times throughout your source code, is always stored as the same entity, e.g. has the same object id.
a = 'name'
a.object_id
=> 557720
b = 'name'
=> 557740
'name'.object_id
=> 1373460
'name'.object_id
=> 1373480 # !! different entity from the one above
# Ruby assumes any string can change at any point in time,
# therefore treating it as a separate entity
# versus:
:name.object_id
=> 71068
:name.object_id
=> 71068
# the symbol :name is a references to the same unique entity
Strings on the other hand are mutable, they can be changed anytime. This implies that Ruby needs to store each string you mention throughout your source code in it's separate entity, e.g. if you have a string "name" multiple times mentioned in your source code, Ruby needs to store these all in separate String objects, because they might change later on (that's the nature of a Ruby string).
If you use a string as a Hash key, Ruby needs to evaluate the string and look at it's contents (and compute a hash function on that) and compare the result against the (hashed) values of the keys which are already stored in the Hash.
If you use a symbol as a Hash key, it's implicit that it's immutable, so Ruby can basically just do a comparison of the (hash function of the) object-id against the (hashed) object-ids of keys which are already stored in the Hash. (much faster)
Downside:
Each symbol consumes a slot in the Ruby interpreter's symbol-table, which is never released.
Symbols are never garbage-collected.
So a corner-case is when you have a large number of symbols (e.g. auto-generated ones). In that case you should evaluate how this affects the size of your Ruby interpreter (e.g. Ruby can run out of memory and blow up if you generate too many symbols programmatically).
Notes:
If you do string comparisons, Ruby can compare symbols just by comparing their object ids, without having to evaluate them. That's much faster than comparing strings, which need to be evaluated.
If you access a hash, Ruby always applies a hash-function to compute a "hash-key" from whatever key you use. You can imagine something like an MD5-hash. And then Ruby compares those "hashed keys" against each other.
Every time you use a string in your code, a new instance is created - string creation is slower than referencing a symbol.
Starting with Ruby 2.1, when you use frozen strings, Ruby will use the same string object. This avoids having to create new copies of the same string, and they are stored in a space that is garbage collected.
Long answers:
https://web.archive.org/web/20180709094450/http://www.reactive.io/tips/2009/01/11/the-difference-between-ruby-symbols-and-strings
http://www.randomhacks.net.s3-website-us-east-1.amazonaws.com/2007/01/20/13-ways-of-looking-at-a-ruby-symbol/
https://www.rubyguides.com/2016/01/ruby-mutability/
The reason is efficiency, with multiple gains over a String:
Symbols are immutable, so the question "what happens if the key changes?" doesn't need to be asked.
Strings are duplicated in your code and will typically take more space in memory.
Hash lookups must compute the hash of the keys to compare them. This is O(n) for Strings and constant for Symbols.
Moreover, Ruby 1.9 introduced a simplified syntax just for hash with symbols keys (e.g. h.merge(foo: 42, bar: 6)), and Ruby 2.0 has keyword arguments that work only for symbol keys.
Notes:
1) You might be surprised to learn that Ruby treats String keys differently than any other type. Indeed:
s = "foo"
h = {}
h[s] = "bar"
s.upcase!
h.rehash # must be called whenever a key changes!
h[s] # => nil, not "bar"
h.keys
h.keys.first.upcase! # => TypeError: can't modify frozen string
For string keys only, Ruby will use a frozen copy instead of the object itself.
2) The letters "b", "a", and "r" are stored only once for all occurrences of :bar in a program. Before Ruby 2.2, it was a bad idea to constantly create new Symbols that were never reused, as they would remain in the global Symbol lookup table forever. Ruby 2.2 will garbage collect them, so no worries.
3) Actually, computing the hash for a Symbol didn't take any time in Ruby 1.8.x, as the object ID was used directly:
:bar.object_id == :bar.hash # => true in Ruby 1.8.7
In Ruby 1.9.x, this has changed as hashes change from one session to another (including those of Symbols):
:bar.hash # => some number that will be different next time Ruby 1.9 is ran
Re: what's the advantage over using a string?
Styling: its the Ruby-way
(Very) slightly faster value look ups since hashing a symbol is equivalent to hashing an integer vs hashing a string.
Disadvantage: consumes a slot in the program's symbol table that is never released.
I'd be very interested in a follow-up regarding frozen strings introduced in Ruby 2.x.
When you deal with numerous strings coming from a text input (I'm thinking of HTTP params or payload, through Rack, for example), it's way easier to use strings everywhere.
When you deal with dozens of them but they never change (if they're your business "vocabulary"), I like to think that freezing them can make a difference. I haven't done any benchmark yet, but I guess it would be close the symbols performance.

ruby hash autovivification (facets)

Here is a clever trick to enable hash autovivification in ruby (taken from facets):
# File lib/core/facets/hash/autonew.rb, line 19
def self.autonew(*args)
leet = lambda { |hsh, key| hsh[key] = new( &leet ) }
new(*args,&leet)
end
Although it works (of course), I find it really frustrating that I can't figure out how this two liner does what it does.
leet is put as a default value. So that then just accessing h['new_key'] somehow brings it up and creates 'new_key' => {}
Now, I'd expect h['new_key'] returning default value object as opposed to evaluating it. That is, 'new_key' => {} is not automatically created. So how does leet actually get called? Especially with two parameters?
The standard new method for Hash accepts a block. This block is called in the event of trying to access a key in the Hash which does not exist. The block is passed the Hash itself and the key that was requested (the two parameters) and should return the value that should be returned for the requested key.
You will notice that the leet lambda does 2 things. It returns a new Hash with leet itself as the block for handling defaults. This is the behaviour which allows autonew to work for Hashes of arbitrary depth. It also assigns this new Hash to hsh[key] so that next time you request the same key you will get the existing Hash rather than a new one being created.
It's also worth noting that this code can be made into a one-liner as follows:
def self.autonew(*args)
new(*args){|hsh, key| hsh[key] = Hash.new(&hsh.default_proc) }
end
The call to Hash#default_proc returns the proc that was used to create the parent, so we have a nice recursive setup here.
I talk about a similar case to this on my blog.
Alternatively, you might consider my xkeys gem. It's a module that you can use to extend arrays or hashes to facilitate nested access.
If you look for something that doesn't exist yet, you get a nil value (or another value or an exception if you prefer) without creating anything by looking. It can also append to the end of arrays.
You can opt to autovivify either hashes or arrays for integer keys (but just once for the entire structure).

Resources