Which way is efficient for storing in hash? - ruby

Assume my_hash = {:name => "bob", :age => 21}. I can assign values to the hash in three ways:
Way 1
my_hash[:name] = "bob"
my_hash[:age] = 21
Way 2
my_hash.store(:name,"bob")
my_hash.store(:age,21)
Way 3
my_hash = {:name => "bob", :age => 21}
Please help me understand value assignment in terms of OS memory. Why are there three ways to assign values to keys, and which way is efficient?

Incase of memory i believe all takes equal memory. I benchmarked each step and these are the results. As you can see speed for each case is just marginally different, not enough difference to choose one over the other.
So you just use the code that you feel natural when writing your code.
user system total real
0.760000 0.030000 0.790000 ( 0.808573) my_hash[:t] = 1
0.810000 0.030000 0.840000 ( 0.842075) my_hash.store(:t, 1)
0.750000 0.020000 0.770000 ( 0.789766) my_hash = {:t => 1}
benchmarking script.
require 'benchmark'
Benchmark.bm do |x|
x.report do
1000000.times do
my_hash = {}
my_hash[:t] = 1
my_hash[:b] = 2
end
end
x.report do
1000000.times do
my_hash = {}
my_hash.store(:t, 1)
my_hash.store(:b, 2)
end
end
x.report do
1000000.times do
my_hash = {:t => 1, :t => 2}
end
end
end

I prefer benchmark-ips for this sort of thing, because it works out how many times the test should be performed and it also gives you some error margins. For this
Benchmark.ips do |x|
x.report('[]') do |n|
n.times do
t = {}
t[:x] = 1
end
end
x.report('store') do |n|
n.times do
t = {}
t.store(:x, 1)
end
end
end
produces
[] 2.082M (±14.6%) i/s - 10.276M
store 1.978M (±13.9%) i/s - 9.790M
i.e. the difference is well within the margin of error, This isn't surprising because if you look at the source then you can see that []= and store are actually exactly the same method.

Related

Is there a performance difference between `select` and `select!` when called on a Ruby hash?

hash = { 'mark' => 1, 'jane' => 1, 'peter' => 35 }.select {|k,v| v > 1}
#=> { 'peter' => 35 }
What if I have millions of keys - is there a difference between
hash = hash.select vs hash.select! ?
select! will perform better (I'll show the source for MRI, but it should be the same for the others).
The reason for this is that select needs to create a whole new Hash object, and will, for each entry in the hash, copy the entry - if the block succeeds.
On the other hand, select!, will, for each key, remove the value - if the block doesn't succeed - in-place (with no need for new object creation).
You can always do a little benchmark:
require 'benchmark'
# Creates a big hash in the format: { 1 => 1, 2 => 2 ... }
big_hash = 100_000.times.inject({}) { |hash, i| hash.tap { |h| h[i] = i } }
Benchmark.bm do |bm|
bm.report('select') { big_hash.select{ |k,v| v > 50 } }
bm.report('select!') { big_hash.select!{ |k,v| v > 50 } }
end
user system total real
select 0.080000 0.000000 0.080000 ( 0.088048)
select! 0.020000 0.000000 0.020000 ( 0.021324)
Absolutely yes. select! is in-place and leads to fewer GC sweeps and less consumption of memory. As proof of concept:
This is ./wrapper.rb:
require 'json'
require 'benchmark'
def measure(&block)
no_gc = ARGV[0] == '--no-gc'
no_gc ? GC.disable : GC.start
memory_before = `ps -o rss= -p #{Process.pid}`.to_f #/ 1024
gc_stat_before = GC.stat
time = Benchmark.realtime do
yield
end
puts ObjectSpace.count_objects
if !no_gc
puts " Sweeping"
GC.start(full_mark: true, immediate_sweep: true, immediate_mark: false)
end
puts ObjectSpace.count_objects
gc_stat_after = GC.stat
memory_after = `ps -o rss= -p #{Process.pid}`.to_f # / 1024
puts({
RUBY_VERSION => {
gc: no_gc ? 'disabled': 'enabled',
time: time.round(2),
gc_count: gc_stat_after[:count] - gc_stat_before[:count],
memory: "%d MB" % (memory_after - memory_before)
}
}.to_json)
puts "---------\n"
end
This is ./so_question.rb:
require_relative './wrapper'
data = Array.new(100) { ["x","y"].sample * 1024 * 1024 }
measure do
data.select! { |x| x.start_with?("x") }
end
measure do
data = data.select { |x| x.start_with?("x") }
end
Running it:
ruby so_question.rb --no-gc
Results in:
{:TOTAL=>30160, :FREE=>21134, :T_OBJECT=>160, :T_CLASS=>557,
:T_MODULE=>38, :T_FLOAT=>7, :T_STRING=>5884, :T_REGEXP=>75,
:T_ARRAY=>710, :T_HASH=>35, :T_STRUCT=>2, :T_BIGNUM=>2, :T_FILE=>3,
:T_DATA=>896, :T_COMPLEX=>1, :T_NODE=>618, :T_ICLASS=>38}
{:TOTAL=>30160, :FREE=>21067, :T_OBJECT=>160, :T_CLASS=>557,
:T_MODULE=>38, :T_FLOAT=>7, :T_STRING=>5947, :T_REGEXP=>75,
:T_ARRAY=>710, :T_HASH=>38, :T_STRUCT=>2, :T_BIGNUM=>2, :T_FILE=>3,
:T_DATA=>897, :T_COMPLEX=>1, :T_NODE=>618, :T_ICLASS=>38}
{"2.2.2":{"gc":"disabled","time":0.0,"gc_count":0,"memory":"20 MB"}}
{:TOTAL=>30160, :FREE=>20922, :T_OBJECT=>162, :T_CLASS=>557, :T_MODULE=>38, :T_FLOAT=>7, :T_STRING=>6072, :T_REGEXP=>75,
:T_ARRAY=>717, :T_HASH=>45, :T_STRUCT=>2, :T_BIGNUM=>2, :T_FILE=>3,
:T_DATA=>901, :T_COMPLEX=>1, :T_NODE=>618, :T_ICLASS=>38}
{:TOTAL=>30160, :FREE=>20885, :T_OBJECT=>162, :T_CLASS=>557,
:T_MODULE=>38, :T_FLOAT=>7, :T_STRING=>6108, :T_REGEXP=>75,
:T_ARRAY=>717, :T_HASH=>46, :T_STRUCT=>2, :T_BIGNUM=>2, :T_FILE=>3,
:T_DATA=>901, :T_COMPLEX=>1, :T_NODE=>618, :T_ICLASS=>38}
{"2.2.2":{"gc":"disabled","time":0.0,"gc_count":0,"memory":"0 MB"}}
Note the memory difference. Also, I made this example with Array instead of Hash, but both will behave in the same way because #select is an enumerator.

Ruby - Initialize has key-value in a loop

I have a hash of key value pairs, similar to -
myhash={'test1' => 'test1', 'test2 => 'test2', ...}
how can I initialize such a hash in a loop? Basically I need it to go from 1..50 with the same test$i values but I cannot figure out how to initialize it properly in a loop instead of doing it manually.
I know how to loop through each key-value pair individually:
myhash.each_pair do |key, value|
but that doesn't help with init
How about:
hash = (1..50).each.with_object({}) do |i, h|
h["test#{i}"] = "test#{i}"
end
If you want to do this lazily, you could do something like below:
hash = Hash.new { |hash, key| key =~ /^test\d+/ ? hash[key] = key : nil}
p hash["test10"]
#=> "test10"
p hash
#=> {"test10"=>"test10"}
The block passed to Hash constructor will be invoked whenever a key is not found in hash, we check whether key follows a certain pattern (based on your need), and create a key-value pair in hash where value is equal to key passed.
(1..50).map { |i| ["test#{i}"] * 2 }.to_h
The solution above is more DRY than two other answers, since "test" is not repeated twice :)
It is BTW, approx 10% faster (that would not be a case when keys and values differ):
require 'benchmark'
n = 500000
Benchmark.bm do |x|
x.report { n.times do ; (1..50).map { |i| ["test#{i}"] * 2 }.to_h ; end }
x.report { n.times do ; (1..50).each.with_object({}) do |i, h| ; h["test#{i}"] = "test#{i}" ; end ; end }
end
user system total real
17.630000 0.000000 17.630000 ( 17.631221)
19.380000 0.000000 19.380000 ( 19.372783)
Or one might use eval:
hash = {}
(1..50).map { |i| eval "hash['test#{i}'] = 'test#{i}'" }
or even JSON#parse:
require 'json'
JSON.parse("{" << (1..50).map { |i| %Q|"test#{i}": "test#{i}"| }.join(',') << "}")
First of all, there's Array#to_h, which converts an array of key-value pairs into a hash.
Second, you can just initialize such a hash in a loop, just do something like this:
target = {}
1.upto(50) do |i|
target["test_#{i}"] = "test_#{i}"
end
You can also do this:
hash = Hash.new{|h, k| h[k] = k.itself}
(1..50).each{|i| hash["test#{i}"]}
hash # => ...

Array#delete_at or Array#slice!? and how to look up implementations

I'm scrubbing large data files (+1MM comma-separated rows). An example row might look like this:
#row = "123456789,11122,CustomerName,2014-01-31,2014-02-01,RemoveThisEntry,R,SKUInfo,05-MAR-14 05:50:24,SourceID,RemoveThisEntryToo,TransactionalID"
Certain columns must be removed from it, after which the row should look like this:
#row = "123456789,11122,CustomerName,2014-01-31,2014-02-01,R,SKUInfo,05-MAR-14 05:50:24,SourceID,TransactionalID"
QUESTION 1: If I convert a row of data into an Array, which method is preferred for removing elements: Array#delete_at or Array#slice!? I'd like to know which is the more idiomatic option. Performance is a consideration here, and I'm on a Windows machine.
def remove_bad_columns
ary = #row.split(",")
ary.delete_at(10)
ary.delete_at(5)
#row = ary.join(",")
end
QUESTION 2: I was wondering if one of these methods was implemented using the other. How can I see how the methods are built in ruby? (How for is implemented using each, for example.)
I suggest you use Array#values_at rather than delete_at or slice!:
def remove_vals(str, *indices)
ary = str.split(",")
v = (0...ary.size).to_a - indices
ary.values_at(*v).join(",")
end
#row = "123456789,11122,CustomerName,2014-01-31,2014-02-01,RemoveThisEntry," +
"R,SKUInfo,05-MAR-14 05:50:24,SourceID,RemoveThisEntryToo,TransactionalID"
#row = remove_vals(#row, 5, 10)
#=> "123456789,11122,CustomerName,2014-01-31,2014-02-01,R,SKUInfo," +
# "05-MAR-14 05:50:24,SourceID,TransactionalID"
Array#values_at has the advantage over the other two methods that you don't have to worry about the order in which the elements are removed.
The efficiency of this method is not significantly different than the other two. If #spickermann would like to add it to his benchmarks, he could use this:
def values_at
ary = array.split(",")
v = (0...ary.size).to_a - [5,10]
#row = ary.values_at(*v).join(",")
end
There is not really a difference in performance. I would prefer delete_at because that reads nicer.
require 'benchmark'
def array
"123456789,11122,CustomerName,2014-01-31,2014-02-01,RemoveThisEntry,R,SKUInfo,05-MAR-14 05:50:24,SourceID,RemoveThisEntryToo,TransactionalID"
end
def delete_at
ary = array.dup.split(",")
ary.delete_at(10)
ary.delete_at(5)
#row = ary.join(",")
end
def slice!
ary = array.dup.split(",")
ary.slice!(10)
ary.slice!(5)
#row = ary.join(",")
end
require 'benchmark'
n = 1_000_000
Benchmark.bmbm(15) do |x|
x.report("delete_at :") { n.times do; delete_at; end }
x.report("slice! :") { n.times do; slice! ; end }
end
# Rehearsal ---------------------------------------------------
# delete_at : 4.560000 0.000000 4.560000 ( 4.566496)
# slice! : 4.580000 0.010000 4.590000 ( 4.576767)
# ------------------------------------------ total: 9.150000sec
#
# user system total real
# delete_at : 4.500000 0.000000 4.500000 ( 4.505638)
# slice! : 4.600000 0.000000 4.600000 ( 4.613447)

Slicing params hash for specific values

Summary
Given a Hash, what is the most efficient way to create a subset Hash based on a list of keys to use?
h1 = { a:1, b:2, c:3 } # Given a hash...
p foo( h1, :a, :c, :d ) # ...create a method that...
#=> { :a=>1, :c=>3, :d=>nil } # ...returns specified keys...
#=> { :a=>1, :c=>3 } # ...or perhaps only keys that exist
Details
The Sequel database toolkit allows one to create or update a model instance by passing in a Hash:
foo = Product.create( hash_of_column_values )
foo.update( another_hash )
The Sinatra web framework makes available a Hash named params that includes form variables, querystring parameters and also route matches.
If I create a form holding only fields named the same as the database columns and post it to this route, everything works very conveniently:
post "/create_product" do
new_product = Product.create params
redirect "/product/#{new_product.id}"
end
However, this is both fragile and dangerous. It's dangerous because a malicious hacker could post a form with columns not intended to be changed and have them updated. It's fragile because using the same form with this route will not work:
post "/update_product/:foo" do |prod_id|
if product = Product[prod_id]
product.update(params)
#=> <Sequel::Error: method foo= doesn't exist or access is restricted to it>
end
end
So, for robustness and security I want to be able to write this:
post "/update_product/:foo" do |prod_id|
if product = Product[prod_id]
# Only update two specific fields
product.update(params.slice(:name,:description))
# The above assumes a Hash (or Sinatra params) monkeypatch
# I will also accept standalone helper methods that perform the same
end
end
...instead of the more verbose and non-DRY option:
post "/update_product/:foo" do |prod_id|
if product = Product[prod_id]
# Only update two specific fields
product.update({
name:params[:name],
description:params[:description]
})
end
end
Update: Benchmarks
Here are the results of benchmarking the (current) implementations:
user system total real
sawa2 0.250000 0.000000 0.250000 ( 0.269027)
phrogz2 0.280000 0.000000 0.280000 ( 0.275027)
sawa1 0.297000 0.000000 0.297000 ( 0.293029)
phrogz3 0.296000 0.000000 0.296000 ( 0.307031)
phrogz1 0.328000 0.000000 0.328000 ( 0.319032)
activesupport 0.639000 0.000000 0.639000 ( 0.657066)
mladen 1.716000 0.000000 1.716000 ( 1.725172)
The second answer by #sawa is the fastest of all, a hair in front of my tap-based implementation (based on his first answer). Choosing to add the check for has_key? adds very little time, and is still more than twice as fast as ActiveSupport.
Here is the benchmark code:
h1 = Hash[ ('a'..'z').zip(1..26) ]
keys = %w[a z c d g A x]
n = 60000
require 'benchmark'
Benchmark.bmbm do |x|
%w[ sawa2 phrogz2 sawa1 phrogz3 phrogz1 activesupport mladen ].each do |m|
x.report(m){ n.times{ h1.send(m,*keys) } }
end
end
I would just use the slice method provided by active_support
require 'active_support/core_ext/hash/slice'
{a: 1, b: 2, c: 3}.slice(:a, :c) # => {a: 1, c: 3}
Of course, make sure to update your gemfile:
gem 'active_support'
I changed by mind. The previous one doesn't seem to be any good.
class Hash
def slice1(*keys)
keys.each_with_object({}){|k, h| h[k] = self[k]}
end
def slice2(*keys)
h = {}
keys.each{|k| h[k] = self[k]}
h
end
end
Sequel has built-in support for only picking specific columns when updating:
product.update_fields(params, [:name, :description])
That doesn't do exactly the same thing if :name or :description is not present in params, though. But assuming you are expecting the user to use your form, that shouldn't be an issue.
I could always expand update_fields to take an option hash with an option that will skip the value if not present in the hash. I just haven't received a request to do that yet.
Perhaps
class Hash
def slice *keys
select{|k| keys.member?(k)}
end
end
Or you could just copy ActiveSupport's Hash#slice, it looks a bit more robust.
Here are my implementations; I will benchmark and accept faster (or sufficiently more elegant) solutions:
# Implementation 1
class Hash
def slice(*keys)
Hash[keys.zip(values_at *keys)]
end
end
# Implementation 2
class Hash
def slice(*keys)
{}.tap{ |h| keys.each{ |k| h[k]=self[k] } }
end
end
# Implementation 3 - silently ignore keys not in the original
class Hash
def slice(*keys)
{}.tap{ |h| keys.each{ |k| h[k]=self[k] if has_key?(k) } }
end
end

Whats the fasted way to extract an array of nested objects from an array of objects in Ruby ?>

I have an array of Elements, and each element has a property :image.
I would like an array of :images, so whats the quickest and least expensive way to achieve this. Is it just iteration over the array and push each element into a new array, something like this:
images = []
elements.each {|element| images << element.image}
elements.map {|element| element.image}
This should have about the same performance as your version, but is somewhat more succinct and more idiomatic.
You can use the Benchmark module to test these sorts of things. I ran #sepp2k's version against your original code like so:
require 'benchmark'
class Element
attr_accessor :image
def initialize(image)
#image = image
end
end
elements = Array.new(500) {|index| Element.new(index)}
n = 10000
Benchmark.bm do |x|
x.report do
n.times do
# Globalkeith's version
image = []
elements.each {|element| image << element.image}
end
end
# sepp2k's version
x.report { n.times do elements.map {|element| element.image} end }
end
The output on my machine was consistently (after more than 3 runs) very close to this:
user system total real
2.140000 0.000000 2.140000 ( 2.143290)
1.420000 0.010000 1.430000 ( 1.422651)
Thus demonstrating that map is significantly faster than manually appending to an array when the array is somewhat large and the operation is performed many times.

Resources