Comparing YAML files in ruby without loading - ruby

I am looking for a way to compare (and, best case, also diff) two YAML files in Ruby; regardless of key order, naturally. So far all solutions I found depended on loading the files with YAML::load_file(). I cannot do that, however, because the files are dumps of Ruby objects whose class declarations I do not have, so that loading them throws undefined class/module.
I think I need to load them as string hashes and compare that, but how do I tell Ruby to ignore the type information and just include it into the comparison?
Based on comments: I'm basically interested in text-based comparison, but it must be aware of the "depth" of the data structure. For instance this is an excerpt from one of the files I have:
attributes: !ruby/hash:Example::Attributes
!binary "b2NjaQ==": !ruby/hash:Example::Attributes
!binary "Y29yZQ==": !ruby/hash:Example::Attributes
!binary "aWQ=": !ruby/object:Example::Properties
type: string
required: false
mutable: false
!binary "dGl0bGU=": !ruby/object:Example::Properties
type: string
required: false
mutable: false
So the comparison must be able to identify a match even if the two attributes are in reverse order.

Psych, Ruby’s Yaml parser, provides several ways to examine Yaml data. The highest level loads the Yaml and provides a Ruby data structure. This is the API that looks at the Yaml tags and tries to load the appropriate Ruby classes, which is causing your problems. It also looks at the format of the data and converts it to various types (e.g. Dates) if it matches.
The next level will parse the Yaml and provide you with an AST containing the “raw” Yaml data. The high level api works by first parsing to this AST and then traversing it using the visitor pattern to create Ruby data (normally a Hash or Array). Unfortunately it doesn’t provide anything in between these two levels, but it is fairly easy to create a parser that creates a simplified data structure.
At its core Yaml data basically consists of scalars (which are basically strings), mappings (hashes) and sequences (arrays) – all of which can have a tag associated with them. The AST provided by Psych consists of these three types (and a couple of others), and we can create our own visitor that traverses it and produces a Ruby structure that consists solely of hashes, arrays and strings.
This is loosely based on the Psych ToRuby visitor class, but instead of trying to convert the data to the appropriate Ruby type it only creates arrays, hashes and strings, throwing away any data in tags:
require 'psych'
class ToPlain < Psych::Visitors::Visitor
# Scalars are just strings.
def visit_Psych_Nodes_Scalar o
o.value
end
# Sequences are arrays.
def visit_Psych_Nodes_Sequence o
o.children.each_with_object([]) do |child, list|
list << accept(child)
end
end
# Mappings are hashes.
def visit_Psych_Nodes_Mapping o
o.children.each_slice(2).each_with_object({}) do |(k,v), h|
h[accept(k)] = accept(v)
end
end
# We also need to handle documents...
def visit_Psych_Nodes_Document o
accept o.root
end
# ... and streams.
def visit_Psych_Nodes_Stream o
o.children.map { |c| accept c }
end
# Aliases aren't handles here :-(
def visit_Psych_Nodes_Alias o
# Not implemented!
end
end
(Note this doesn’t handle aliases. It’s not too difficult to add support for them, have a look at what ToRuby does, in particular the register method and how it’s used.)
You can make use of this like this:
# Could also use parse_stream or parse_file here
ast = YAML.parse(my_data)
data = ToPlain.new.accept(ast)
# data now consists of just arrays, hashes and strings
If you use this on your example data, the result is a hash that looks something like this:
{
"attributes"=>{
"b2NjaQ=="=>{
"Y29yZQ=="=>{
"aWQ="=>{
"type"=>"string",
"required"=>"false",
"mutable"=>"false"
},
"dGl0bGU="=>{
"type"=>"string",
"required"=>"false",
"mutable"=>"false"
}
}
}
}
}
Whilst the keys are little unwieldy because you are using binary data, you can still make comparisons like this:
occi_core_id = data["attributes"]["b2NjaQ=="]["Y29yZQ=="]["aWQ="]
occi_core_title = data["attributes"]["b2NjaQ=="]["Y29yZQ=="]["dGl0bGU="]
puts occi_core_id == occi_core_title

Related

Fetch from hash with either Singular or Plural

I get the following input hash in my ruby code
my_hash = { include: 'a,b,c' }
(or)
my_hash = { includes: 'a,b,c' }
Now I want the fastest way to get 'a,b,c'
I currently use
def my_includes
my_hash[:include] || my_hash[:includes]
end
But this is very slow because it always checks for :include keyword first then if it fails it'll look for :includes. I call this function several times and the value inside this hash can keep changing. Is there any way I can optimise and speed up this? I won't get any other keywords. I just need support for :include and :includes.
Caveats and Considerations
First, some caveats:
You tagged this Rails 3, so you're probably on a very old Ruby that doesn't support a number of optimizations, newer Hash-related method calls like #fetch_values or #transform_keys!, or pattern matching for structured data.
You can do all sorts of things with your Hash lookups, but none of them are likely to be faster than a Boolean short-circuit when assuming you can be sure of having only one key or the other at all times.
You haven't shown any of the calling code, so without benchmarks it's tough to see how this operation can be considered "slow" in any general sense.
If you're using Rails and not looking for a pure Ruby solution, you might want to consider ActiveModel::Dirty to only take action when an attribute has changed.
Use Memoization
Regardless of the foregoing, what you're probably missing here is some form of memoization so you don't need to constantly re-evaluate the keys and extract the values each time through whatever loop feels slow to you. For example, you could store the results of your Hash evaluation until it needs to be refreshed:
attr_accessor :includes
def extract_includes(hash)
#includes = hash[:include] || hash[:includes]
end
You can then call #includes or #includes= (or use the #includes instance variable directly if you like) from anywhere in scope as often as you like without having to re-evaluate the hashes or keys. For example:
def count_includes
#includes.split(?,).count
end
500.times { count_includes }
The tricky part is basically knowing if and when to update your memoized value. Basically, you should only call #extract_includes when you fetch a new Hash from somewhere like ActiveRecord or a remote API. Until that happens, you can reuse the stored value for as long as it remains valid.
You could work with a modified hash that has both keys :include and :includes with the same values:
my_hash = { include: 'a,b,c' }
my_hash.update(my_hash.key?(:include) ? { includes: my_hash[:include] } :
{ include: my_hash[:includes] })
#=> {:include=>"a,b,c", :includes=>"a,b,c"}
This may be fastest if you were using the same hash my_hash for multiple operations. If, however, a new hash is generated after just a few interrogations, you might see if both the keys :include and :includes can be included when the hash is constructed.

Is there a similar solution as Array#wrap for hashes?

I use Array.wrap(x) all the time in order to ensure that Array methods actually exist on an object before calling them.
What is the best way to similarly ensure a Hash?
Example:
def ensure_hash(x)
# TODO: this is what I'm looking for
end
values = [nil,1,[],{},'',:a,1.0]
values.all?{|x| ensure_hash(x).respond_to?(:keys) } # true
The best I've been able to come up with so far is:
Hash::try_convert(x) || {}
However, I would prefer something more elegant.
tl; dr: In an app with proper error handling, there is no "easy, care-free" way to handle something that may or may not be hashy.
From a conceptual standpoint, the answer is no. There is no similar solution as Array.wrap(x) for hashes.
An array is a collection of values. Single values can be stored outside of arrays (e.g. x = 42) , so it's a straight-forward task to wrap a value in an array (a = [42]).
A hash is a collection of key-value pairs. In ruby, single key-value pairs can't exist outside of a hash. The only way to express a key-value pair is with a hash: h = { v: 42 }
Of course, there are a thousand ways to express a key-value pair as a single value. You could use an array [k, v] or a delimited string `"k:v" or some more obscure method.
But at that point, you're no longer wrapping, you're parsing. Parsing relies on properly formatted data and has multiple points of failure. No matter how you look at it, if you find yourself in a situation where you may or may not have a hash, that means you need to write a proper chunk of code for data validation and parsing (or refactor your upstream code so that you can always expect a hash).

Ruby hashes, duck typing and JSON deserialization

I am pretty new to Ruby and currently discovering its differences from Java, consider the following code snippet:
file = File.new('test.json', 'w')
hash = {}
hash['1234'] = 'onetwothreefour_str'
hash[1234] = 'onetwothreefour_num'
puts hash.to_json
file.write(hash.to_json)
file.close
str = File.read('test.json')
puts str
puts JSON.parse(str)
it outputs
{"1234":"onetwothreefour_str","1234":"onetwothreefour_num"}
{"1234":"onetwothreefour_str","1234":"onetwothreefour_num"}
{"1234"=>"onetwothreefour_num"}
so, after deserialization we have one less object in hash.
Now, the question - is it normal behaviour? I think that it is perfectly legal to store in hash keys of different types. If so, then shouldn't JSON.parse write to file keys as '1234' and 1234?
Just to be clear - I understand that it's better to have keys of the same type, I just saw that after restoring my object has them as strings instead of numbers.
Yes, ruby hashes can have keys of whatever type.
JSON spec, on the other hand, dictates that object keys must be strings, no other type allowed.
So that explains the output you observe: upon serializing, integer key is turned into a string, making it a duplicate of another key. When reading it back, duplicate keys are dropped (last one wins, IIRC). I'm pretty sure you would get the same behaviour if you tried to use that json from javascript.

Intersection of multiple hashes in Ruby

The scenario is like so: I have a directory which has multiple JSON files which have similar data, but not the same exact data (the structure is the same, but the data may not not necessarily be the same).
I need to find the keys which are similar between all the JSON files (i.e the intersection of all the JSON files).
I load the JSON files like so
require 'json'
ARGV.each {|x|
JSON.parse(File.read(x))
}
From here, I don't know how to get the intersection of the hashes.
I know you can use sets, like so
require 'json'
require 'set'
ARGV.each {|x|
JSON.parse(File.read(x)).to_set
}.reduce(:&)
But as per this post Hashes Vs. Set Performace, hashes seem to be faster (although I guess it depends the use cases)
So, how can I find the intersection of multiple hashes (where the key value pairs are the same), without using Set?
You don't need to use a set. A set maintains that all elements are unique, and a JSON hash will never have two identical key-value pairs. I'd just use a normal array (to_a).
One problem is that you are actually calling reduce(:&) on ARGV, instead of the parsed JSON. You can change each to map to fix this:
ARGV.map { |x|
JSON.parse(File.read(x)).to_a
}.reduce(:&)
If you want to convert this back into a hash form, you can use to_h.

Why is :key.hash != 'key'.hash in Ruby?

I'm learning Ruby right now for the Rhodes mobile application framework and came across this problem: Rhodes' HTTP client parses JSON responses into Ruby data structures, e.g.
puts #params # prints {"body"=>{"results"=>[]}}
Since the key "body" is a string here, my first attempt #params[:body] failed (is nil) and instead it must be #params['body']. I find this most unfortunate.
Can somebody explain the rationale why strings and symbols have different hashes, i.e. :body.hash != 'body'.hash in this case?
Symbols and strings serve two different purposes.
Strings are your good old familiar friends: mutable and garbage-collectable. Every time you use a string literal or #to_s method, a new string is created. You use strings to build HTML markup, output text to screen and whatnot.
Symbols, on the other hand, are different. Each symbol exists only in one instance and it exists always (i.e, it is not garbage-collected). Because of that you should make new symbols very carefully (String#to_sym and :'' literal). These properties make them a good candidate for naming things. For example, it's idiomatic to use symbols in macros like attr_reader :foo.
If you got your hash from an external source (you deserialized a JSON response, for example) and you want to use symbols to access its elements, then you can either use HashWithIndifferentAccess (as others pointed out), or call helper methods from ActiveSupport:
require 'active_support/core_ext'
h = {"body"=>{"results"=>[]}}
h.symbolize_keys # => {:body=>{"results"=>[]}}
h.stringify_keys # => {"body"=>{"results"=>[]}}
Note that it'll only touch top level and will not go into child hashes.
Symbols and Strings are never ==:
:foo == 'foo' # => false
That's a (very reasonable) design decision. After all, they have different classes, methods, one is mutable the other isn't, etc...
Because of that, it is mandatory that they are never eql?:
:foo.eql? 'foo' # => false
Two objects that are not eql? typically don't have the same hash, but even if they did, the Hash lookup uses hash and then eql?. So your question really was "why are symbols and strings not eql?".
Rails uses HashWithIndifferentAccess that accesses indifferently with strings or symbols.
In Rails, the params hash is actually a HashWithIndifferentAccess rather than a standard ruby Hash object. This allows you to use either strings like 'action' or symbols like :action to access the contents.
You will get the same results regardless of what you use, but keep in mind this only works on HashWithIndifferentAccess objects.
Copied from : Params hash keys as symbols vs strings

Resources