What to do on a Ruby Hash, to optimize/fasten lookup ? (read only access)
Ex: freezing the hash, sorting the keys, forcing numerical keys...
Use frozen_string_literal:
frozen_string_literal reduces the generated garbage by ~100MB or ~20%! Free performance by adding a one line comment.
Conclusion
Gem authors: add # frozen_string_literal: true to the top of all Ruby files in a gem. It gives a free performance improvement to all your users as long as you don’t use String mutation.
Mike Perham: Ruby Optimization with One Magic Comment, 2018-02-28: https://www.mikeperham.com/2018/02/28/ruby-optimization-with-one-magic-comment/
Use Symbol as hash keys:
If you use Ruby 2.2, Symbol could be more performant than String as Hash keys.
Fast Ruby: https://github.com/fastruby/fast-ruby#hash
Related
I'm doing some update on other one's code and now I have a hash, it's like:
{"instance_id"=>"74563c459c457b2288568ec0a7779f62", "mem_quota"=>536870912, "disk_quota"=>2147483648, "mem_usage"=>59164.0, "cpu_usage"=>0.1, "disk_usage"=>6336512}
and I want to get the value by symbol as a key, for example: :mem_quota, but failed.
The code is like:
instance[:mem_usage].to_f
but it returns nothing. Is there any reason can cause this problem?
Use instance["mem_usage"] instead since the hash is not using symbols.
The other explanations are correct, but to give a broader background:
You are probably used to working within Rails where a very specific variant of Hash, called HashWithIndifferentAccess, is used for things like params. This particular class works like a standard ruby Hash, except when you access keys you are allowed to use either Symbols or Strings. The standard Ruby Hash, and generally speaking, Hash implementations in other languages, expect that to access an element, the key used for later access should be an object of the same class and value as the key used to store the object. HashWithIndifferentAccess is a Rails convenience class provided via the Active Support libraries. You are free to use them yourself, but they have first be brought in by requiring them.
HashWithIndifferentAccess just does the conversion for you at access time from string to symbol.
So, for your case, instance["mem_usage"].to_f should work.
You need HashWithIndifferentAccess.
require 'active_support/core_ext'
h1 = {"instance_id"=>"74563c459c457b2288568ec0a7779f62", "mem_quota"=>536870912,
"disk_quota"=>2147483648, "mem_usage"=>59164.0, "cpu_usage"=>0.1,
"disk_usage"=>6336512}
h2 = h1.with_indifferent_access
h1[:mem_usage] # => nil
h1["mem_usage"] # => 59164.0
h2[:mem_usage] # => 59164.0
h2["mem_usage"] # => 59164.0
Also, there are the symbolize_keys and stringify_keys options that may be of help. The method names are self-descriptive enough, I believe.
Clearly the keys of your hash are strings because they have double quotes around them. Therefore you will need to access the keys with instance["mem_usage"] or you will need to build a new hash with symbols as the keys first.
If you use Rails with ActiveSupport, then do use HashWithIndifferentAccess for flexibility in accessing hash with either string or symbol.
hash = HashWithIndifferentAccess.new({
"instance_id"=>"74563c459c457b2288568ec0a7779f62",
"mem_quota"=>536870912, "disk_quota"=>2147483648,
"mem_usage"=>59164.0,
"cpu_usage"=>0.1,
"disk_usage"=>6336512
})
hash[:mem_usage] # => 59164.0
hash["mem_usage"] # => 59164.0
Rudimentary irb testing suggests that Ruby Hash returns .keys and .values in matching order. Is it safe to assume that this is the case?
Yes. According to the Ruby Docs for Hash, "Hashes enumerate their values in the order that the corresponding keys were inserted." So you should always get the same order for a hash if it is created in the same way.
Depends on which Ruby version you are running.
Up to 1.8, enumeration was not insertion-ordered. Starting with 1.9, it will enumerate keys and values according to insertion order so, yes, it is safe to assume as long as you are running 1.9.
A lot of times people use symbols as keys in a Ruby hash.
What's the advantage over using a string?
E.g.:
hash[:name]
vs.
hash['name']
TL;DR:
Using symbols not only saves time when doing comparisons, but also saves memory, because they are only stored once.
Ruby Symbols are immutable (can't be changed), which makes looking something up much easier
Short(ish) answer:
Using symbols not only saves time when doing comparisons, but also saves memory, because they are only stored once.
Symbols in Ruby are basically "immutable strings" .. that means that they can not be changed, and it implies that the same symbol when referenced many times throughout your source code, is always stored as the same entity, e.g. has the same object id.
a = 'name'
a.object_id
=> 557720
b = 'name'
=> 557740
'name'.object_id
=> 1373460
'name'.object_id
=> 1373480 # !! different entity from the one above
# Ruby assumes any string can change at any point in time,
# therefore treating it as a separate entity
# versus:
:name.object_id
=> 71068
:name.object_id
=> 71068
# the symbol :name is a references to the same unique entity
Strings on the other hand are mutable, they can be changed anytime. This implies that Ruby needs to store each string you mention throughout your source code in it's separate entity, e.g. if you have a string "name" multiple times mentioned in your source code, Ruby needs to store these all in separate String objects, because they might change later on (that's the nature of a Ruby string).
If you use a string as a Hash key, Ruby needs to evaluate the string and look at it's contents (and compute a hash function on that) and compare the result against the (hashed) values of the keys which are already stored in the Hash.
If you use a symbol as a Hash key, it's implicit that it's immutable, so Ruby can basically just do a comparison of the (hash function of the) object-id against the (hashed) object-ids of keys which are already stored in the Hash. (much faster)
Downside:
Each symbol consumes a slot in the Ruby interpreter's symbol-table, which is never released.
Symbols are never garbage-collected.
So a corner-case is when you have a large number of symbols (e.g. auto-generated ones). In that case you should evaluate how this affects the size of your Ruby interpreter (e.g. Ruby can run out of memory and blow up if you generate too many symbols programmatically).
Notes:
If you do string comparisons, Ruby can compare symbols just by comparing their object ids, without having to evaluate them. That's much faster than comparing strings, which need to be evaluated.
If you access a hash, Ruby always applies a hash-function to compute a "hash-key" from whatever key you use. You can imagine something like an MD5-hash. And then Ruby compares those "hashed keys" against each other.
Every time you use a string in your code, a new instance is created - string creation is slower than referencing a symbol.
Starting with Ruby 2.1, when you use frozen strings, Ruby will use the same string object. This avoids having to create new copies of the same string, and they are stored in a space that is garbage collected.
Long answers:
https://web.archive.org/web/20180709094450/http://www.reactive.io/tips/2009/01/11/the-difference-between-ruby-symbols-and-strings
http://www.randomhacks.net.s3-website-us-east-1.amazonaws.com/2007/01/20/13-ways-of-looking-at-a-ruby-symbol/
https://www.rubyguides.com/2016/01/ruby-mutability/
The reason is efficiency, with multiple gains over a String:
Symbols are immutable, so the question "what happens if the key changes?" doesn't need to be asked.
Strings are duplicated in your code and will typically take more space in memory.
Hash lookups must compute the hash of the keys to compare them. This is O(n) for Strings and constant for Symbols.
Moreover, Ruby 1.9 introduced a simplified syntax just for hash with symbols keys (e.g. h.merge(foo: 42, bar: 6)), and Ruby 2.0 has keyword arguments that work only for symbol keys.
Notes:
1) You might be surprised to learn that Ruby treats String keys differently than any other type. Indeed:
s = "foo"
h = {}
h[s] = "bar"
s.upcase!
h.rehash # must be called whenever a key changes!
h[s] # => nil, not "bar"
h.keys
h.keys.first.upcase! # => TypeError: can't modify frozen string
For string keys only, Ruby will use a frozen copy instead of the object itself.
2) The letters "b", "a", and "r" are stored only once for all occurrences of :bar in a program. Before Ruby 2.2, it was a bad idea to constantly create new Symbols that were never reused, as they would remain in the global Symbol lookup table forever. Ruby 2.2 will garbage collect them, so no worries.
3) Actually, computing the hash for a Symbol didn't take any time in Ruby 1.8.x, as the object ID was used directly:
:bar.object_id == :bar.hash # => true in Ruby 1.8.7
In Ruby 1.9.x, this has changed as hashes change from one session to another (including those of Symbols):
:bar.hash # => some number that will be different next time Ruby 1.9 is ran
Re: what's the advantage over using a string?
Styling: its the Ruby-way
(Very) slightly faster value look ups since hashing a symbol is equivalent to hashing an integer vs hashing a string.
Disadvantage: consumes a slot in the program's symbol table that is never released.
I'd be very interested in a follow-up regarding frozen strings introduced in Ruby 2.x.
When you deal with numerous strings coming from a text input (I'm thinking of HTTP params or payload, through Rack, for example), it's way easier to use strings everywhere.
When you deal with dozens of them but they never change (if they're your business "vocabulary"), I like to think that freezing them can make a difference. I haven't done any benchmark yet, but I guess it would be close the symbols performance.
When I first started reading about and learning ruby, I read something about the power of ruby symbols over strings: symbols are stored in memory only once, while strings are stored in memory once per string, even if they are the same.
For instance: Rails' params Hash in the Controller has a bunch of keys as symbols:
params[:id] or
params[:title]...
But other decently sized projects such as Sinatra and Jekyll don't do that:
Jekyll:
post.data["title"] or
post.data["tags"]...
Sinatra:
params["id"] or
params["title"]...
This makes reading new code a little tricky, and makes it hard to transfer code around and to figure out why using symbols isn't working. There are many more examples of this and it's kind of confusing. Should we or shouldn't we be using symbols in this case? What are the advantages of symbols and should we be using them here?
In ruby, after creating the AST, each symbol is represented as a unique integer. Having symbols as hash keys makes the computing a lot faster, as the main operation is comparison.
Symbols are not garbage collected AFAIK, so that might be a thing to watch out for, but except for that they really are great as hash keys.
One reason for the usage of strings may be the usage of yaml to define the values.
require 'yaml'
data = YAML.load(<<-data
one:
title: one
tag: 1
two:
title: two
tag: 2
data
) #-> {"one"=>{"title"=>"one", "tag"=>1}, "two"=>{"title"=>"two", "tag"=>2}}
You may use yaml to define symbol-keys:
require 'yaml'
data = YAML.load(<<-data
:one:
:title: one
:tag: 1
:two:
:title: two
:tag: 2
data
) #-> {:one=>{:title=>"one", :tag=>1}, :two=>{:title=>"two", :tag=>2}}
But in the yaml-definition symbols look a bit strange, strings looks more natural.
Another reason for strings as keys: Depending on the use case, it can be reasonable to sort by keys, but you can't sort symbols (at least not without a conversion to strings).
The main difference is that multiple symbols representing a single value are identical whereas this is not true with strings. For example:
irb(main):007:0> :test.object_id
=> 83618
irb(main):008:0> :test.object_id
=> 83618
irb(main):009:0> :test.object_id
=> 83618
3 references to the symbol :test, all the same object.
irb(main):010:0> "test".object_id
=> -605770378
irb(main):011:0> "test".object_id
=> -605779298
irb(main):012:0> "test".object_id
=> -605784948
3 references to the string "test", all different objects.
This means that using symbols can potentially save a good bit of memory depending on the application. It is also faster to compare symbols for equality since they are the same object, comparing identical strings is much slower since the string values need to be compared instead of just the object ids.
I usually use strings for almost everything except things like hash keys where I really want a unique identifier, not a string
I'm writing a web app that points to external links. I'm looking to create a non-sequential, non-guessable id for each document that I can use in the URL. I did the obvious thing: treating the url as a string and str#crypt on it, but that seems to choke on any non-alphanumberic characters, like the slashes, dots and underscores.
Any suggestions on the best way to solve this problem?
Thanks!
Depending on how long a string you would like you can use a few alternatives:
require 'digest'
Digest.hexencode('http://foo-bar.com/yay/?foo=bar&a=22')
# "687474703a2f2f666f6f2d6261722e636f6d2f7961792f3f666f6f3d62617226613d3232"
require 'digest/md5'
Digest::MD5.hexdigest('http://foo-bar.com/yay/?foo=bar&a=22')
# "43facc5eb5ce09fd41a6b55dba3fe2fe"
require 'digest/sha1'
Digest::SHA1.hexdigest('http://foo-bar.com/yay/?foo=bar&a=22')
# "2aba83b05dc9c2d9db7e5d34e69787d0a5e28fc5"
require 'digest/sha2'
Digest::SHA2.hexdigest('http://foo-bar.com/yay/?foo=bar&a=22')
# "e78f3d17c1c0f8d8c4f6bd91f175287516ecf78a4027d627ebcacfca822574b2"
Note that this won't be unguessable, you may have to combine it with some other (secret but static) data to salt the string:
salt = 'foobar'
Digest::SHA1.hexdigest(salt + 'http://foo-bar.com/yay/?foo=bar&a=22')
# "dbf43aff5e808ae471aa1893c6ec992088219bbb"
Now it becomes much harder to generate this hash for someone who doesn't know the original content and has no access to your source.
I would also suggest looking at the different algorithms in the digest namespace. To make it harder to guess, rather than (or in addition to) salting with a secret passphrase, you can also use a precise dump of the time:
require 'digest/md5'
def hash_url(url)
Digest::MD5.hexdigest("#{Time.now.to_f}--#{url}")
end
Since the result of any hashing algorithm is not guaranteed to be unique, don't forget to check for the uniqueness of your result against previously generated hashes before assuming that your hash is usable. The use of Time.now makes the retry trivial to implement, since you only have to call until a unique hash is generated.
Use Digest::MD5 from Ruby's standard library:
Digest::MD5.hexdigest(my_url)