JSON generate unique hash value (SHA-512) - ruby

I'm searching for a way to generate a SHA-512 hash from a json string in Ruby, independent from the positions of the elements in it, and independent from nestings, arrays, nested arrays and so on. I just want to hash the raw data along with its keys.
I tried some approaches with converting the JSON into a ruby hash, deep sort them by their keys, append everything into one, long string and hash it. But I bet that my solution isn't the most efficient one, and that there must be a better way to do this.
EDIT
So far, I convert JSON into a Ruby hash. Then I try to use this function to get a canonical representation:
def self.canonical_string_from_hash value, key=nil
str = ""
if value.is_a? Hash
value.keys.sort.each do |k|
str += canonical_string_from_hash(value[k], k)
end
elsif value.is_a? Array
str += key.to_s
value.each do |v|
str += canonical_string_from_hash(v)
end
else
str += key ? "#{key}#{value}" : value.to_s
end
return str
end
But I'm not sure, if this is a good and efficient way to do this.
For example, this hash
hash = {
id: 3,
zoo: "test",
global: [
{ukulele: "ringding", blub: 3},
{blub: nil, ukulele: "rangdang", guitar: "stringstring"}
],
foo: {
ids: [3,4,5],
bar: "asdf"
}
}
gets converted to this string:
barasdfids345globalblub3ukuleleringdingblubguitarstringstringukulelerangdangid3zootest

But I'm not sure, if this is a good and efficient way to do this.
Depends on what you are trying to do. Your canonical/equivalent structures need to represent what is important to you for the comparison. Removing details such as object structure makes sense if you consider two items with different structure but same string values equivalent.
According to your comments, you are attempting to sign a request that is being transferred from one system to a second one. In other words you want security, not a measure of similarity or a digital fingerprint for some other purpose. Therefore equivalent requests are ones that are identical in all the ways that affect the processing that you want to protect. It is simpler, and very likely more secure, to lock down the raw bytes of data that transfer between your two systems.
In which case your whole approach needs a re-think. The reasons for that are probably best discussed on security.stackoverflow.com
However, in brief:
Use an HMAC routine (HMAC-SHA512), it is designed for your purpose. Instead of a salt, this uses a secret, which is essentially the same thing (in fact you need to keep your salt a secret in your implementation too, which is unusual for something called a salt), but has been combined with the SHA in a way which makes it resilient to a couple of attack forms possible against simple concatenation followed by SHA. The worst of these is that it is possible to extend the data and have it generate the same SHA when processed, without needing to know the salt. In other words, an attacker could take a known valid request and use it to forge other requests which will get past your security check. Your proposed solution looks vulnerable to this form of attack to me.
Unpacking the request and analysing the details to get a "canonical" view of the request is not necessary, and also reduces the security of your solution. The only reason for doing this is that you are for some reason not able to handle the request once it has been serialised to JSON, and are forced to work only with the de-serialised request at one end or another of the two systems. If that is purely a knowledge or convenience thing, then fix that problem rather than trying to roll your own security protocol using SHA-512.
You should sign the request, and check the signature, against the fully serialised JSON string. If you need to de-serialise data from a "man-in-the-middle" attack, then you are potentially already exposed to some attacks via the parser. You should work to reject suspect requests before any data processing has been done to them.
TL;DR - ALthough not a direct answer to your question, the correct solution for you is to not write this code at all. Instead you need to place your secure signature code closer to the ins and outs of your two services that need to trust each other.

Related

Fetch from hash with either Singular or Plural

I get the following input hash in my ruby code
my_hash = { include: 'a,b,c' }
(or)
my_hash = { includes: 'a,b,c' }
Now I want the fastest way to get 'a,b,c'
I currently use
def my_includes
my_hash[:include] || my_hash[:includes]
end
But this is very slow because it always checks for :include keyword first then if it fails it'll look for :includes. I call this function several times and the value inside this hash can keep changing. Is there any way I can optimise and speed up this? I won't get any other keywords. I just need support for :include and :includes.
Caveats and Considerations
First, some caveats:
You tagged this Rails 3, so you're probably on a very old Ruby that doesn't support a number of optimizations, newer Hash-related method calls like #fetch_values or #transform_keys!, or pattern matching for structured data.
You can do all sorts of things with your Hash lookups, but none of them are likely to be faster than a Boolean short-circuit when assuming you can be sure of having only one key or the other at all times.
You haven't shown any of the calling code, so without benchmarks it's tough to see how this operation can be considered "slow" in any general sense.
If you're using Rails and not looking for a pure Ruby solution, you might want to consider ActiveModel::Dirty to only take action when an attribute has changed.
Use Memoization
Regardless of the foregoing, what you're probably missing here is some form of memoization so you don't need to constantly re-evaluate the keys and extract the values each time through whatever loop feels slow to you. For example, you could store the results of your Hash evaluation until it needs to be refreshed:
attr_accessor :includes
def extract_includes(hash)
#includes = hash[:include] || hash[:includes]
end
You can then call #includes or #includes= (or use the #includes instance variable directly if you like) from anywhere in scope as often as you like without having to re-evaluate the hashes or keys. For example:
def count_includes
#includes.split(?,).count
end
500.times { count_includes }
The tricky part is basically knowing if and when to update your memoized value. Basically, you should only call #extract_includes when you fetch a new Hash from somewhere like ActiveRecord or a remote API. Until that happens, you can reuse the stored value for as long as it remains valid.
You could work with a modified hash that has both keys :include and :includes with the same values:
my_hash = { include: 'a,b,c' }
my_hash.update(my_hash.key?(:include) ? { includes: my_hash[:include] } :
{ include: my_hash[:includes] })
#=> {:include=>"a,b,c", :includes=>"a,b,c"}
This may be fastest if you were using the same hash my_hash for multiple operations. If, however, a new hash is generated after just a few interrogations, you might see if both the keys :include and :includes can be included when the hash is constructed.

Is there a similar solution as Array#wrap for hashes?

I use Array.wrap(x) all the time in order to ensure that Array methods actually exist on an object before calling them.
What is the best way to similarly ensure a Hash?
Example:
def ensure_hash(x)
# TODO: this is what I'm looking for
end
values = [nil,1,[],{},'',:a,1.0]
values.all?{|x| ensure_hash(x).respond_to?(:keys) } # true
The best I've been able to come up with so far is:
Hash::try_convert(x) || {}
However, I would prefer something more elegant.
tl; dr: In an app with proper error handling, there is no "easy, care-free" way to handle something that may or may not be hashy.
From a conceptual standpoint, the answer is no. There is no similar solution as Array.wrap(x) for hashes.
An array is a collection of values. Single values can be stored outside of arrays (e.g. x = 42) , so it's a straight-forward task to wrap a value in an array (a = [42]).
A hash is a collection of key-value pairs. In ruby, single key-value pairs can't exist outside of a hash. The only way to express a key-value pair is with a hash: h = { v: 42 }
Of course, there are a thousand ways to express a key-value pair as a single value. You could use an array [k, v] or a delimited string `"k:v" or some more obscure method.
But at that point, you're no longer wrapping, you're parsing. Parsing relies on properly formatted data and has multiple points of failure. No matter how you look at it, if you find yourself in a situation where you may or may not have a hash, that means you need to write a proper chunk of code for data validation and parsing (or refactor your upstream code so that you can always expect a hash).

Ruby hashes, duck typing and JSON deserialization

I am pretty new to Ruby and currently discovering its differences from Java, consider the following code snippet:
file = File.new('test.json', 'w')
hash = {}
hash['1234'] = 'onetwothreefour_str'
hash[1234] = 'onetwothreefour_num'
puts hash.to_json
file.write(hash.to_json)
file.close
str = File.read('test.json')
puts str
puts JSON.parse(str)
it outputs
{"1234":"onetwothreefour_str","1234":"onetwothreefour_num"}
{"1234":"onetwothreefour_str","1234":"onetwothreefour_num"}
{"1234"=>"onetwothreefour_num"}
so, after deserialization we have one less object in hash.
Now, the question - is it normal behaviour? I think that it is perfectly legal to store in hash keys of different types. If so, then shouldn't JSON.parse write to file keys as '1234' and 1234?
Just to be clear - I understand that it's better to have keys of the same type, I just saw that after restoring my object has them as strings instead of numbers.
Yes, ruby hashes can have keys of whatever type.
JSON spec, on the other hand, dictates that object keys must be strings, no other type allowed.
So that explains the output you observe: upon serializing, integer key is turned into a string, making it a duplicate of another key. When reading it back, duplicate keys are dropped (last one wins, IIRC). I'm pretty sure you would get the same behaviour if you tried to use that json from javascript.

Maximum arity of ruby function?

I am looking to make an efficient function to clear out a redis-based cache.
I have a method call that returns a number of keys from redis:
$redis.keys("foo:*")
That returns all the keys that start with "foo:". Next, I'd like to delete all the values for these keys.
One (memory-intensive) way to do this is:
$redis.keys("foo:*").each do |key|
$redis.del(key)
end
I'd like to avoid loading all the keys into memory, and then making numerous requests to the redis server.
Another way that I like is to use the splat operator:
keys = $redis.keys("foo:*")
$redis.del(*keys)
The problem is that I don't know what the maximum arity of the $redis.del method, nor of any ruby method, I can't seem to find it online.
What is the maximum arity?
#muistooshort in the comments had a good suggestion that turned out to be right, the redis driver knows what to do with an array argument:
# there are 1,000,000 keys of the form "foo:#{number}"
keys = $redis.keys("foo:*")
$redis.del(keys) # => 1000000
Simply pass an array of keys to $redis.del

What's the best way to hash a url in ruby?

I'm writing a web app that points to external links. I'm looking to create a non-sequential, non-guessable id for each document that I can use in the URL. I did the obvious thing: treating the url as a string and str#crypt on it, but that seems to choke on any non-alphanumberic characters, like the slashes, dots and underscores.
Any suggestions on the best way to solve this problem?
Thanks!
Depending on how long a string you would like you can use a few alternatives:
require 'digest'
Digest.hexencode('http://foo-bar.com/yay/?foo=bar&a=22')
# "687474703a2f2f666f6f2d6261722e636f6d2f7961792f3f666f6f3d62617226613d3232"
require 'digest/md5'
Digest::MD5.hexdigest('http://foo-bar.com/yay/?foo=bar&a=22')
# "43facc5eb5ce09fd41a6b55dba3fe2fe"
require 'digest/sha1'
Digest::SHA1.hexdigest('http://foo-bar.com/yay/?foo=bar&a=22')
# "2aba83b05dc9c2d9db7e5d34e69787d0a5e28fc5"
require 'digest/sha2'
Digest::SHA2.hexdigest('http://foo-bar.com/yay/?foo=bar&a=22')
# "e78f3d17c1c0f8d8c4f6bd91f175287516ecf78a4027d627ebcacfca822574b2"
Note that this won't be unguessable, you may have to combine it with some other (secret but static) data to salt the string:
salt = 'foobar'
Digest::SHA1.hexdigest(salt + 'http://foo-bar.com/yay/?foo=bar&a=22')
# "dbf43aff5e808ae471aa1893c6ec992088219bbb"
Now it becomes much harder to generate this hash for someone who doesn't know the original content and has no access to your source.
I would also suggest looking at the different algorithms in the digest namespace. To make it harder to guess, rather than (or in addition to) salting with a secret passphrase, you can also use a precise dump of the time:
require 'digest/md5'
def hash_url(url)
Digest::MD5.hexdigest("#{Time.now.to_f}--#{url}")
end
Since the result of any hashing algorithm is not guaranteed to be unique, don't forget to check for the uniqueness of your result against previously generated hashes before assuming that your hash is usable. The use of Time.now makes the retry trivial to implement, since you only have to call until a unique hash is generated.
Use Digest::MD5 from Ruby's standard library:
Digest::MD5.hexdigest(my_url)

Resources