The scenario is like so: I have a directory which has multiple JSON files which have similar data, but not the same exact data (the structure is the same, but the data may not not necessarily be the same).
I need to find the keys which are similar between all the JSON files (i.e the intersection of all the JSON files).
I load the JSON files like so
require 'json'
ARGV.each {|x|
JSON.parse(File.read(x))
}
From here, I don't know how to get the intersection of the hashes.
I know you can use sets, like so
require 'json'
require 'set'
ARGV.each {|x|
JSON.parse(File.read(x)).to_set
}.reduce(:&)
But as per this post Hashes Vs. Set Performace, hashes seem to be faster (although I guess it depends the use cases)
So, how can I find the intersection of multiple hashes (where the key value pairs are the same), without using Set?
You don't need to use a set. A set maintains that all elements are unique, and a JSON hash will never have two identical key-value pairs. I'd just use a normal array (to_a).
One problem is that you are actually calling reduce(:&) on ARGV, instead of the parsed JSON. You can change each to map to fix this:
ARGV.map { |x|
JSON.parse(File.read(x)).to_a
}.reduce(:&)
If you want to convert this back into a hash form, you can use to_h.
Related
I get the following input hash in my ruby code
my_hash = { include: 'a,b,c' }
(or)
my_hash = { includes: 'a,b,c' }
Now I want the fastest way to get 'a,b,c'
I currently use
def my_includes
my_hash[:include] || my_hash[:includes]
end
But this is very slow because it always checks for :include keyword first then if it fails it'll look for :includes. I call this function several times and the value inside this hash can keep changing. Is there any way I can optimise and speed up this? I won't get any other keywords. I just need support for :include and :includes.
Caveats and Considerations
First, some caveats:
You tagged this Rails 3, so you're probably on a very old Ruby that doesn't support a number of optimizations, newer Hash-related method calls like #fetch_values or #transform_keys!, or pattern matching for structured data.
You can do all sorts of things with your Hash lookups, but none of them are likely to be faster than a Boolean short-circuit when assuming you can be sure of having only one key or the other at all times.
You haven't shown any of the calling code, so without benchmarks it's tough to see how this operation can be considered "slow" in any general sense.
If you're using Rails and not looking for a pure Ruby solution, you might want to consider ActiveModel::Dirty to only take action when an attribute has changed.
Use Memoization
Regardless of the foregoing, what you're probably missing here is some form of memoization so you don't need to constantly re-evaluate the keys and extract the values each time through whatever loop feels slow to you. For example, you could store the results of your Hash evaluation until it needs to be refreshed:
attr_accessor :includes
def extract_includes(hash)
#includes = hash[:include] || hash[:includes]
end
You can then call #includes or #includes= (or use the #includes instance variable directly if you like) from anywhere in scope as often as you like without having to re-evaluate the hashes or keys. For example:
def count_includes
#includes.split(?,).count
end
500.times { count_includes }
The tricky part is basically knowing if and when to update your memoized value. Basically, you should only call #extract_includes when you fetch a new Hash from somewhere like ActiveRecord or a remote API. Until that happens, you can reuse the stored value for as long as it remains valid.
You could work with a modified hash that has both keys :include and :includes with the same values:
my_hash = { include: 'a,b,c' }
my_hash.update(my_hash.key?(:include) ? { includes: my_hash[:include] } :
{ include: my_hash[:includes] })
#=> {:include=>"a,b,c", :includes=>"a,b,c"}
This may be fastest if you were using the same hash my_hash for multiple operations. If, however, a new hash is generated after just a few interrogations, you might see if both the keys :include and :includes can be included when the hash is constructed.
I use Array.wrap(x) all the time in order to ensure that Array methods actually exist on an object before calling them.
What is the best way to similarly ensure a Hash?
Example:
def ensure_hash(x)
# TODO: this is what I'm looking for
end
values = [nil,1,[],{},'',:a,1.0]
values.all?{|x| ensure_hash(x).respond_to?(:keys) } # true
The best I've been able to come up with so far is:
Hash::try_convert(x) || {}
However, I would prefer something more elegant.
tl; dr: In an app with proper error handling, there is no "easy, care-free" way to handle something that may or may not be hashy.
From a conceptual standpoint, the answer is no. There is no similar solution as Array.wrap(x) for hashes.
An array is a collection of values. Single values can be stored outside of arrays (e.g. x = 42) , so it's a straight-forward task to wrap a value in an array (a = [42]).
A hash is a collection of key-value pairs. In ruby, single key-value pairs can't exist outside of a hash. The only way to express a key-value pair is with a hash: h = { v: 42 }
Of course, there are a thousand ways to express a key-value pair as a single value. You could use an array [k, v] or a delimited string `"k:v" or some more obscure method.
But at that point, you're no longer wrapping, you're parsing. Parsing relies on properly formatted data and has multiple points of failure. No matter how you look at it, if you find yourself in a situation where you may or may not have a hash, that means you need to write a proper chunk of code for data validation and parsing (or refactor your upstream code so that you can always expect a hash).
I am pretty new to Ruby and currently discovering its differences from Java, consider the following code snippet:
file = File.new('test.json', 'w')
hash = {}
hash['1234'] = 'onetwothreefour_str'
hash[1234] = 'onetwothreefour_num'
puts hash.to_json
file.write(hash.to_json)
file.close
str = File.read('test.json')
puts str
puts JSON.parse(str)
it outputs
{"1234":"onetwothreefour_str","1234":"onetwothreefour_num"}
{"1234":"onetwothreefour_str","1234":"onetwothreefour_num"}
{"1234"=>"onetwothreefour_num"}
so, after deserialization we have one less object in hash.
Now, the question - is it normal behaviour? I think that it is perfectly legal to store in hash keys of different types. If so, then shouldn't JSON.parse write to file keys as '1234' and 1234?
Just to be clear - I understand that it's better to have keys of the same type, I just saw that after restoring my object has them as strings instead of numbers.
Yes, ruby hashes can have keys of whatever type.
JSON spec, on the other hand, dictates that object keys must be strings, no other type allowed.
So that explains the output you observe: upon serializing, integer key is turned into a string, making it a duplicate of another key. When reading it back, duplicate keys are dropped (last one wins, IIRC). I'm pretty sure you would get the same behaviour if you tried to use that json from javascript.
I am looking for a way to compare (and, best case, also diff) two YAML files in Ruby; regardless of key order, naturally. So far all solutions I found depended on loading the files with YAML::load_file(). I cannot do that, however, because the files are dumps of Ruby objects whose class declarations I do not have, so that loading them throws undefined class/module.
I think I need to load them as string hashes and compare that, but how do I tell Ruby to ignore the type information and just include it into the comparison?
Based on comments: I'm basically interested in text-based comparison, but it must be aware of the "depth" of the data structure. For instance this is an excerpt from one of the files I have:
attributes: !ruby/hash:Example::Attributes
!binary "b2NjaQ==": !ruby/hash:Example::Attributes
!binary "Y29yZQ==": !ruby/hash:Example::Attributes
!binary "aWQ=": !ruby/object:Example::Properties
type: string
required: false
mutable: false
!binary "dGl0bGU=": !ruby/object:Example::Properties
type: string
required: false
mutable: false
So the comparison must be able to identify a match even if the two attributes are in reverse order.
Psych, Ruby’s Yaml parser, provides several ways to examine Yaml data. The highest level loads the Yaml and provides a Ruby data structure. This is the API that looks at the Yaml tags and tries to load the appropriate Ruby classes, which is causing your problems. It also looks at the format of the data and converts it to various types (e.g. Dates) if it matches.
The next level will parse the Yaml and provide you with an AST containing the “raw” Yaml data. The high level api works by first parsing to this AST and then traversing it using the visitor pattern to create Ruby data (normally a Hash or Array). Unfortunately it doesn’t provide anything in between these two levels, but it is fairly easy to create a parser that creates a simplified data structure.
At its core Yaml data basically consists of scalars (which are basically strings), mappings (hashes) and sequences (arrays) – all of which can have a tag associated with them. The AST provided by Psych consists of these three types (and a couple of others), and we can create our own visitor that traverses it and produces a Ruby structure that consists solely of hashes, arrays and strings.
This is loosely based on the Psych ToRuby visitor class, but instead of trying to convert the data to the appropriate Ruby type it only creates arrays, hashes and strings, throwing away any data in tags:
require 'psych'
class ToPlain < Psych::Visitors::Visitor
# Scalars are just strings.
def visit_Psych_Nodes_Scalar o
o.value
end
# Sequences are arrays.
def visit_Psych_Nodes_Sequence o
o.children.each_with_object([]) do |child, list|
list << accept(child)
end
end
# Mappings are hashes.
def visit_Psych_Nodes_Mapping o
o.children.each_slice(2).each_with_object({}) do |(k,v), h|
h[accept(k)] = accept(v)
end
end
# We also need to handle documents...
def visit_Psych_Nodes_Document o
accept o.root
end
# ... and streams.
def visit_Psych_Nodes_Stream o
o.children.map { |c| accept c }
end
# Aliases aren't handles here :-(
def visit_Psych_Nodes_Alias o
# Not implemented!
end
end
(Note this doesn’t handle aliases. It’s not too difficult to add support for them, have a look at what ToRuby does, in particular the register method and how it’s used.)
You can make use of this like this:
# Could also use parse_stream or parse_file here
ast = YAML.parse(my_data)
data = ToPlain.new.accept(ast)
# data now consists of just arrays, hashes and strings
If you use this on your example data, the result is a hash that looks something like this:
{
"attributes"=>{
"b2NjaQ=="=>{
"Y29yZQ=="=>{
"aWQ="=>{
"type"=>"string",
"required"=>"false",
"mutable"=>"false"
},
"dGl0bGU="=>{
"type"=>"string",
"required"=>"false",
"mutable"=>"false"
}
}
}
}
}
Whilst the keys are little unwieldy because you are using binary data, you can still make comparisons like this:
occi_core_id = data["attributes"]["b2NjaQ=="]["Y29yZQ=="]["aWQ="]
occi_core_title = data["attributes"]["b2NjaQ=="]["Y29yZQ=="]["dGl0bGU="]
puts occi_core_id == occi_core_title
I am looking to make an efficient function to clear out a redis-based cache.
I have a method call that returns a number of keys from redis:
$redis.keys("foo:*")
That returns all the keys that start with "foo:". Next, I'd like to delete all the values for these keys.
One (memory-intensive) way to do this is:
$redis.keys("foo:*").each do |key|
$redis.del(key)
end
I'd like to avoid loading all the keys into memory, and then making numerous requests to the redis server.
Another way that I like is to use the splat operator:
keys = $redis.keys("foo:*")
$redis.del(*keys)
The problem is that I don't know what the maximum arity of the $redis.del method, nor of any ruby method, I can't seem to find it online.
What is the maximum arity?
#muistooshort in the comments had a good suggestion that turned out to be right, the redis driver knows what to do with an array argument:
# there are 1,000,000 keys of the form "foo:#{number}"
keys = $redis.keys("foo:*")
$redis.del(keys) # => 1000000
Simply pass an array of keys to $redis.del