Which of these Array Initializations is better in Ruby? - ruby

Which of these two forms of Array Initialization is better in Ruby?
Method 1:
DAYS_IN_A_WEEK = (0..6).to_a
HOURS_IN_A_DAY = (0..23).to_a
#data = Array.new(DAYS_IN_A_WEEK.size).map!{ Array.new(HOURS_IN_A_DAY.size) }
DAYS_IN_A_WEEK.each do |day|
HOURS_IN_A_DAY.each do |hour|
#data[day][hour] = 'something'
end
end
Method 2:
DAYS_IN_A_WEEK = (0..6).to_a
HOURS_IN_A_DAY = (0..23).to_a
#data = {}
DAYS_IN_A_WEEK.each do |day|
HOURS_IN_A_DAY.each do |hour|
#data[day] ||= {}
#data[day][hour] = 'something'
end
end
The difference between the first method and the second method is that the second one does not allocate memory initially. I feel the second one is a bit inferior when it comes to performance due to the numerous amount of Array copies that has to happen.
However, it is not straight forward in Ruby to find what is happening. So, if someone can explain me which is better, it would be really great!
Thanks

Before I answer the question you asked, I'm going to answer the question you should have asked but didn't:
Q: Should I focus on making my code readable first, or should I focus on performance first?
A: Make your code readable and correct first, then, and only if there is a performance problem, start to worry about performance by measuring where the performance problem is first and only then making changes to your code.
Now to answer the question you asked, but shouldn't have:
method1.rb:
DAYS_IN_A_WEEK = (0..6).to_a
HOURS_IN_A_DAY = (0..23).to_a
10000.times do
#data = Array.new(DAYS_IN_A_WEEK.size).map!{ Array.new(HOURS_IN_A_DAY.size) }
DAYS_IN_A_WEEK.each do |day|
HOURS_IN_A_DAY.each do |hour|
#data[day][hour] = 'something'
end
end
end
method2.rb:
DAYS_IN_A_WEEK = (0..6).to_a
HOURS_IN_A_DAY = (0..23).to_a
10000.times do
#data = {}
DAYS_IN_A_WEEK.each do |day|
HOURS_IN_A_DAY.each do |hour|
#data[day] ||= {}
#data[day][hour] = 'something'
end
end
end
Results of brain-dead benchmark:
$ time ruby method1.rb
real 0m1.189s
user 0m1.140s
sys 0m0.000s
$ time ruby method2.rb
real 0m1.879s
user 0m1.780s
sys 0m0.020s
Looks to me like user time usage (the important factor) has method1.rb a lot faster. You, of course, should not trust this benchmark and should make your own reflecting your actual code use. This, however, is something you should do only after you have determined which code is your performance bottleneck in reality. (Hint: 99.44% of computer programmers are 100% wrong when they guess where their bottlenecks are without measuring!)

What's wrong with just
#data = Array.new(7) { Array.new(24) { 'something' }}
Or, if you are content having the same object everywhere:
#data = Array.new(7) { Array.new(24, 'something') }
It's much faster, not that it would matter. It is also much more readable, which is the most important thing. After all, the purpose of code is communicating intent to the other stakeholders, not communicating with the computer. user system total real
method1 8.969000 0.000000 8.969000 ( 9.059570)
method2 16.547000 0.000000 16.547000 (16.799805)
method3 6.468000 0.000000 6.468000 ( 6.616211)
method4 0.969000 0.015000 0.984000 ( 1.021484)That last line also shows another interesting thing: the runtime is dominated by the time needed to create the 7*24*100000 = 16.8 million 'something' strings.
And of course there is another important obversation: your method1 and method2 that you are comparing against each other do two completely different things! It doesn't even make sense to compare them against each other. method1 creates an Array, method2 creates a Hash.
Your method1 is equivalent to my first example above:
#data = Array.new(7) { Array.new(24) { 'something' }}
While method2 is (very roughly) equivalent to:
#data = Hash.new {|h, k| h[k] = Hash.new {|h, k| h[k] = 'something' }}
Well, except that your method2 initializes the entire Hash eagerly, while my method only executes the initialization code lazily in case an uninitialized key is read.
In other words, after running the above initialization code, the Hash is still empty:
#data # => {}
But whenever you try to access a key, it will magically appear:
#data[5][17] # => 'something'
And it will stay there:
#data # => {5 => {17 => 'something'}}
Since this code doesn't actually initialize the Hash, it is obviously way faster: user system total real
method5 0.266000 0.000000 0.266000 ( 0.296875)

I wrapped both of the code snippets into separate methods and did some benchmarking. Here are the results:
Benchmark.bm(7) do |x|
x.report ("method1") { 100000.times { method1 } }
x.report ("method2") { 100000.times { method2 } }
end
user system total real
method1 11.370000 0.010000 11.380000 ( 11.392233)
method2 17.920000 0.010000 17.930000 ( 18.328318)

Related

What is an idiomatic way to measure time in Ruby?

This is pretty ugly:
t = Time.now
result = do_something
elapsed = Time.now - t
I tried this:
elapsed = time do
result = do_something
end
def time
t = Time.now
yield
Time.now - t
end
This is better. But the problem is that result falls out of scope after the block ends.
So, is there a better way of doing timing? Or a good way to use the result?
A really idiomatic way would be to use the standard library. :)
require 'benchmark'
result = nil
elapsed = Benchmark.realtime do
result = do_something
end
You've got the right idea here, but to avoid the scope problem do this:
result = nil
elapsed = time do
result = do_something
end
I like the way you've constructed your time method. I have no suggestions for improvement, but I will say a few words about a related problem. Suppose you wished to measure the amount of time spent executing methods. Sometimes you might be able to write something simple such as:
require 'time'
t = Time.now
rv = my_method(*args)
et = t.Time.now - t
Other times that's not convenient. Suppose, for example, you were constructing an array whose elements were the return values of my_method or my_method returned an enumerator so that it could be chained to other methods.
As an example, let's suppose you wanted to sum the values of an array until a zero is encountered. One way to do that is to construct an enumerator stop_at_zero that generates values from its receiver until it encounters a zero, then stops (i.e., raises a StopIteration exception). We could then write:
arr.stop_at_zero.reduce(:+)
If we want to know how much time is spent executing stop_at_zero we could construct it as follows.
class Array
def stop_at_zero
extime = Time.now
Enumerator.new do |y|
begin
each do |n|
sleep(0.5)
return y if n.zero?
y << n
end
ensure
$timings << [__method__, Time.now - extime]
end
end
end
end
I used a begin, ensure, end block to make sure $timings << [__method__, Time.now - extime] is executed when the method returns prematurely. sleep(0.5) is of course just for illustrative purposes.
Let's try it.
$timings = []
arr = [1,7,0,3,4]
arr.stop_at_zero.reduce(:+)
#=> 8
$timings
#=> [[:stop_at_zero, 1.505672]]
$timings will contain a history of execution times of all methods that contain the timing code.

Dynamic methods using define_method and eval

I've put together two sample classes implemented in a couple of different ways which pretty well mirrors what I want to do in my Rails model. My concern is that I don't know what, if any are the concerns of using either method. And I've only found posts which explain how to implement them or a general warning to avoid/ be careful when using them. What I have not found is a clear explanation of how to accomplish this safely, and what I'm being careful of or why I should avoid this pattern.
class X
attr_accessor :yn_sc, :um_sc
def initialize
#yn_sc = 0
#um_sc = 0
end
types = %w(yn um)
types.each do |t|
define_method("#{t}_add") do |val|
val = ActiveRecord::Base.send(:sanitize_sql_array, ["%s", val])
eval("##{t}_sc += #{val}")
end
end
end
class X
attr_accessor :yn_sc, :um_sc
def initialize
#yn_sc = 0
#um_sc = 0
end
types = %w(yn um)
types.each do |t|
# eval <<-EVAL also works
self.class_eval <<-EVAL
def #{t}_add(val)
##{t}_sc += val
end
EVAL
end
end
x = X.new
x.yn_add(1) #=> x.yn_sc == 1 for both
Well, your code looks realy safe. But imagine a code based on user input. It might be look something like
puts 'Give me an order, sir!'
order = gets.chomp
eval(order)
What will happen if our captain will go wild and order us to 'rm -rf ~/'? Sad things for sure!
So take a little lesson. eval is not safe because it evaluates every string it receives.
But there's another reason not to use eval. Sometimes it evaluates slower than alternatives. Look here if interested.

Why is this array building method so slow?

This method is taking over 7 seconds with 50 markets and 2,500 flows (~250,000 iterations). Why so slow?
def matrix
[:origin, :destination].collect do |location|
markets.collect do |market|
network.flows.collect { |flow| flow[location] == market ? 1 : 0 }
end
end.flatten
end
I know that the slowness comes from the comparison of one market to another market based on benchmarks that I've run.
Here are the relevant parts of the class that's being compared.
module FreightFlow
class Market
include ActiveAttr::Model
attribute :coordinates
def ==(value)
coordinates == value.coordinates
end
end
end
What's the best way to make this faster?
You are constructing 100 intermediate collections (2*50) comprising of a total of 250,000 (2*50*2500) elements, and then flattening it at the end. I would try constructing the whole data structure in one pass. Make sure that markets and network.flows are stored in a hash or set. Maybe something like:
def matrix
network.flows.collect do |flow|
(markets.h­as_key? flow[:origin] or
marke­ts.has_key­? flow[:destination]) ? 1 : 0
end
end
This is a simple thing but it can help...
In your innermost loop you're doing:
network.flows.collect { |flow| flow[location] == market ? 1 : 0 }
Instead of using the ternary statement to convert to 1 or 0, use true and false Booleans instead:
network.flows.collect { |flow| flow[location] == market }
This isn't a big difference in speed, but over the course of that many nested loops it adds up.
In addition, it allows you to simplify your tests using the matrix being generated. Instead of having to compare to 1 or 0, you can simplify your conditional tests to if flow[location], if !flow[location] or unless flow[location], again speeding up your application a little bit for each test. If those are deeply nested in loops, which is very likely, that little bit can add up again.
Something that is important to do, when speed is important, is use Ruby's Benchmark class to test various ways of doing the same task. Then, instead of guessing, you KNOW what works. You'll find lots of questions on Stack Overflow where I've supplied an answer that consists of a benchmark showing the speed differences between various ways of doing something. Sometimes the differences are very big. For instance:
require 'benchmark'
puts `ruby -v`
def test1()
true
end
def test2(p1)
true
end
def test3(p1, p2)
true
end
N = 10_000_000
Benchmark.bm(5) do |b|
b.report('?:') { N.times { (1 == 1) ? 1 : 0 } }
b.report('==') { N.times { (1 == 1) } }
b.report('if') {
N.times {
if (1 == 1)
1
else
0
end
}
}
end
Benchmark.bm(5) do |b|
b.report('test1') { N.times { test1() } }
b.report('test2') { N.times { test2('foo') } }
b.report('test3') { N.times { test3('foo', 'bar') } }
b.report('test4') { N.times { true } }
end
And the results:
ruby 1.9.3p392 (2013-02-22 revision 39386) [x86_64-darwin10.8.0]
user system total real
?: 1.880000 0.000000 1.880000 ( 1.878676)
== 1.780000 0.000000 1.780000 ( 1.785718)
if 1.920000 0.000000 1.920000 ( 1.914225)
user system total real
test1 2.760000 0.000000 2.760000 ( 2.760861)
test2 4.800000 0.000000 4.800000 ( 4.808184)
test3 6.920000 0.000000 6.920000 ( 6.915318)
test4 1.640000 0.000000 1.640000 ( 1.637506)
ruby 2.0.0p0 (2013-02-24 revision 39474) [x86_64-darwin10.8.0]
user system total real
?: 2.280000 0.000000 2.280000 ( 2.285408)
== 2.090000 0.010000 2.100000 ( 2.087504)
if 2.350000 0.000000 2.350000 ( 2.363972)
user system total real
test1 2.900000 0.010000 2.910000 ( 2.899922)
test2 7.070000 0.010000 7.080000 ( 7.092513)
test3 11.010000 0.030000 11.040000 ( 11.033432)
test4 1.660000 0.000000 1.660000 ( 1.667247)
There are two different sets of tests. The first is looking to see what the differences are with simple conditional tests vs. using == without a ternary to get just the Booleans. The second is to test the effect of calling a method, a method with a single parameter, and with two parameters, vs. "inline-code" to find out the cost of the setup and tear-down when calling a method.
Modern C compilers do some amazing things when they analyze the code prior to emitting the assembly language to be compiled. We can fine-tune them to write for size or speed. When we go for speed, the program grows as the compiler looks for loops it can unroll and places it can "inline" code, to avoid making the CPU jump around and throwing away stuff that's in the cache.
Ruby is much higher up the language chain, but some of the same ideas still apply. We can write in a very DRY manner, and avoid repetition and use methods and classes to abstract our data, but the cost is reduced processing speed. The answer is to write your code intelligently and don't waste CPU time and unroll/inline where necessary to gain speed and other times be DRY to make your code more maintainable.
It's all a balancing act, and there's a time for writing both ways.
Memoizing the indexes of the markets within the flows was way faster than any other solution. Time reduced from ~30 seconds when the question was asked to 0.6 seconds.
First, I added a flow_index in the Network class. It stores the indexes of the flows that contain the markets.
def flow_index
#flow_index ||= begin
flow_index = {}
[:origin, :destination].each do |location|
flow_index[location] = {}
flows.each { |flow| flow_index[location][flow[location]] = [] }
flows.each_with_index { |flow, i| flow_index[location][flow[location]] << i }
end
flow_index
end
end
Then, I refactored the matrix method to use the flow index.
def matrix
base_row = network.flows.count.times.collect { 0 }
[:origin, :destination].collect do |location|
markets.collect do |market|
row = base_row.dup
network.flow_index[location][market].each do |i|
row[i] = 1
end
row
end
end.flatten
end
The base_row is created with all 0s and you just replace with 1s at the locations from the flow_index for that market.

Measure user time or system time in Ruby without Benchmark or time

Since I'm doing some time measurements at the moment, I wondered if it is possible to measure the user time or system time without using the Benchmark class or the command line utility time.
Using the Time class only reveals the wall clock time, not system and user time, however I'm looking for a solution which has the same flexibility, e.g.
time = TimeUtility.now
# some code
user, system, real = TimeUtility.now - time
The reason is that I somehow dislike Benchmark, since it cannot return numbers only (EDIT: I was wrong - it can. See answers below.). Sure, I could parse the output, but that doesn't feels right. The time utility from *NIX systems should solve my problem as well, but I wanted to know if there already is some kind of wrapper implemented in Ruby so I don't need to make these system calls by myself.
Thanks a lot!
I re-read the Benchmark documentation and saw that it has a method named measure. This method does exactly what I want: Measure the time your code needs and returning an object which contains user time, system time, system time of childrens etc. It is as easy as
require 'benchmark'
measurement = Benchmark.measure do
# your code goes here
end
In the process I found out that you can add custom rows to the Benchmark output. You can use this to get the best of both worlds (custom time measurements and a nice output at the end) as follows:
require 'benchmark'
measurements = []
10.times { measurements << Benchmark.measure { 1_000_000.times { a = "1" } } }
# measurements.sum or measurements.inject(0){...} does not work, since the
# array contains Benchmark instances, which cannot be coerced into Fixnum's
# Array#sum will work if you are using Rails
sum = measurements.inject(nil) { |sum, t| sum.nil? ? sum = t : sum += t }
avg = sum / measurements.size
# 7 is the width reserved for the description "sum:" and "avg:"
Benchmark.bm(7, "sum:", "avg:") do |b|
[sum, avg]
end
The result will look like the following:
user system total real
sum: 2.700000 0.000000 2.700000 ( 2.706234)
avg: 0.270000 0.000000 0.270000 ( 0.270623)
You could use the Process::times function, which returns user time/system time. (It does not report wall clock time, you'll need something else for that). Seems to be a bit version or OS dependent though.
This is what it reports on my system (linux, ruby 1.8.7):
$ irb
irb(main):001:0> t = Process.times
=> #<struct Struct::Tms utime=0.01, stime=0.0, cutime=0.0, cstime=0.0>
The docs show this though, so some versions/implementations might only have the first two:
t = Process.times
[ t.utime, t.stime ] #=> [0.0, 0.02]
See times for the underlying call on Linux.
Here's a really crappy wrapper that kind of supports -:
class SysTimes
attr_accessor :user, :system
def initialize
times = Process.times
#user = times.utime
#system = times.stime
end
def -(other)
diff = SysTimes.new
diff.user = #user - other.user
diff.system = #system - other.system
diff
end
end
Should give you ideas to make it work nicely in your context.
This gem might help:
https://github.com/igorkasyanchuk/benchmark_methods
No more code like this:
t = Time.now
user.calculate_report
puts Time.now - t
Now you can do:
benchmark :calculate_report # in class
And just call your method
user.calculate_report

Ruby: Comparing two Arrays of Hashes

I'm definitely a newbie to ruby (and using 1.9.1), so any help is appreciated. Everything I've learned about Ruby has been from using google. I'm trying to compare two arrays of hashes and due to the sizes, it's taking way to long and flirts with running out of memory. Any help would be appreciated.
I have a Class (ParseCSV) with multiple methods (initialize, open, compare, strip, output).
The way I have it working right now is as follows (and this does pass the tests I've written, just using a much smaller data set):
file1 = ParseCSV.new(“some_file”)
file2 = ParseCSV.new(“some_other_file”)
file1.open #this reads the file contents into an Array of Hash’s through the CSV library
file1.strip #This is just removing extra hash’s from each array index. So normally there are fifty hash’s in each array index, this is just done to help reduce memory consumption.
file2.open
file2.compare(“file1.storage”) ##storage is The array of hash’s from the open method
file2.output
Now what I’m struggling with is the compare method. Working on smaller data sets it’s not a big deal at all, works fast enough. However in this case I’m comparing about 400,000 records (all read into the array of hashes) against one that has about 450,000 records. I’m trying to speed this up. Also I can’t run the strip method on file2. Here is how I’m doing it now:
def compare(x)
#obviously just a verbose message
puts "Comparing and leaving behind non matching entries"
x.each do |row|
##storage is the array of hashes
#storage.each_index do |y|
if row[#opts[:field]] == #storage[y][#opts[:field]]
#storage.delete_at(y)
end
end
end
end
Hopefully that makes sense. I know it’s going to be a slow process just because it has to iterate 400,000 rows 440,000 times each. But do you have any other ideas on how to speed it up and possibly reduce memory consumption?
Yikes, that'll be O(n^2) runtime. Nasty.
A better bet would be to use the built in Set class.
Code would look something like:
require 'set'
file1_content = load_file_content_into_array_here("some_file")
file2_content = load_file_content_into_array_here("some_other_file")
file1_set = Set[file1_content]
unique_elements = file1_set - file2_content
That assumes that the files themselves have unique content. Should work in the generic case, but may have quirks depending on what your data looks like and how you parse it, but as long as the lines can be compared with == it should help you out.
Using a set will be MUCH faster than doing a nested loop to iterate over the file content.
(and yes, I have actually done this to process files with ~2 million lines, so it should be able to handle your case - eventually. If you're doing heavy data munging, Ruby may not be the best choice of tool though)
Here's a script comparing two ways of doing it: Your original compare() and a new_compare(). The new_compare uses more of the built in Enumerable methods. Since they are implemented in C, they'll be faster.
I created a constant called Test::SIZE to try out the benchmarks with different hash sizes. Results at the bottom. The difference is huge.
require 'benchmark'
class Test
SIZE = 20000
attr_accessor :storage
def initialize
file1 = []
SIZE.times { |x| file1 << { :field => x, :foo => x } }
#storage = file1
#opts = {}
#opts[:field] = :field
end
def compare(x)
x.each do |row|
#storage.each_index do |y|
if row[#opts[:field]] == #storage[y][#opts[:field]]
#storage.delete_at(y)
end
end
end
end
def new_compare(other)
other_keys = other.map { |x| x[#opts[:field]] }
#storage.reject! { |s| other_keys.include? s[#opts[:field]] }
end
end
storage2 = []
# We'll make 10 of them match
10.times { |x| storage2 << { :field => x, :foo => x } }
# And the rest wont
(Test::SIZE-10).times { |x| storage2 << { :field => x+100000000, :foo => x} }
Benchmark.bm do |b|
b.report("original compare") do
t1 = Test.new
t1.compare(storage2)
end
end
Benchmark.bm do |b|
b.report("new compare") do
t1 = Test.new
t1.new_compare(storage2)
end
end
Results:
Test::SIZE = 500
user system total real
original compare 0.280000 0.000000 0.280000 ( 0.285366)
user system total real
new compare 0.020000 0.000000 0.020000 ( 0.020458)
Test::SIZE = 1000
user system total real
original compare 28.140000 0.110000 28.250000 ( 28.618907)
user system total real
new compare 1.930000 0.010000 1.940000 ( 1.956868)
Test::SIZE = 5000
ruby test.rb
user system total real
original compare113.100000 0.440000 113.540000 (115.041267)
user system total real
new compare 7.680000 0.020000 7.700000 ( 7.739120)
Test::SIZE = 10000
user system total real
original compare453.320000 1.760000 455.080000 (460.549246)
user system total real
new compare 30.840000 0.110000 30.950000 ( 31.226218)

Resources