Ruby: Comparing two Arrays of Hashes

Ruby: Comparing two Arrays of Hashes - ruby

I'm definitely a newbie to ruby (and using 1.9.1), so any help is appreciated. Everything I've learned about Ruby has been from using google. I'm trying to compare two arrays of hashes and due to the sizes, it's taking way to long and flirts with running out of memory. Any help would be appreciated.
I have a Class (ParseCSV) with multiple methods (initialize, open, compare, strip, output).
The way I have it working right now is as follows (and this does pass the tests I've written, just using a much smaller data set):
file1 = ParseCSV.new(“some_file”)
file2 = ParseCSV.new(“some_other_file”)
file1.open #this reads the file contents into an Array of Hash’s through the CSV library
file1.strip #This is just removing extra hash’s from each array index. So normally there are fifty hash’s in each array index, this is just done to help reduce memory consumption.
file2.open
file2.compare(“file1.storage”) ##storage is The array of hash’s from the open method
file2.output
Now what I’m struggling with is the compare method. Working on smaller data sets it’s not a big deal at all, works fast enough. However in this case I’m comparing about 400,000 records (all read into the array of hashes) against one that has about 450,000 records. I’m trying to speed this up. Also I can’t run the strip method on file2. Here is how I’m doing it now:
def compare(x)
#obviously just a verbose message
puts "Comparing and leaving behind non matching entries"
x.each do |row|
##storage is the array of hashes
#storage.each_index do |y|
if row[#opts[:field]] == #storage[y][#opts[:field]]
#storage.delete_at(y)
end
end
end
end
Hopefully that makes sense. I know it’s going to be a slow process just because it has to iterate 400,000 rows 440,000 times each. But do you have any other ideas on how to speed it up and possibly reduce memory consumption?

Yikes, that'll be O(n^2) runtime. Nasty.
A better bet would be to use the built in Set class.
Code would look something like:
require 'set'
file1_content = load_file_content_into_array_here("some_file")
file2_content = load_file_content_into_array_here("some_other_file")
file1_set = Set[file1_content]
unique_elements = file1_set - file2_content
That assumes that the files themselves have unique content. Should work in the generic case, but may have quirks depending on what your data looks like and how you parse it, but as long as the lines can be compared with == it should help you out.
Using a set will be MUCH faster than doing a nested loop to iterate over the file content.
(and yes, I have actually done this to process files with ~2 million lines, so it should be able to handle your case - eventually. If you're doing heavy data munging, Ruby may not be the best choice of tool though)

Here's a script comparing two ways of doing it: Your original compare() and a new_compare(). The new_compare uses more of the built in Enumerable methods. Since they are implemented in C, they'll be faster.
I created a constant called Test::SIZE to try out the benchmarks with different hash sizes. Results at the bottom. The difference is huge.
require 'benchmark'
class Test
SIZE = 20000
attr_accessor :storage
def initialize
file1 = []
SIZE.times { |x| file1 << { :field => x, :foo => x } }
#storage = file1
#opts = {}
#opts[:field] = :field
end
def compare(x)
x.each do |row|
#storage.each_index do |y|
if row[#opts[:field]] == #storage[y][#opts[:field]]
#storage.delete_at(y)
end
end
end
end
def new_compare(other)
other_keys = other.map { |x| x[#opts[:field]] }
#storage.reject! { |s| other_keys.include? s[#opts[:field]] }
end
end
storage2 = []
# We'll make 10 of them match
10.times { |x| storage2 << { :field => x, :foo => x } }
# And the rest wont
(Test::SIZE-10).times { |x| storage2 << { :field => x+100000000, :foo => x} }
Benchmark.bm do |b|
b.report("original compare") do
t1 = Test.new
t1.compare(storage2)
end
end
Benchmark.bm do |b|
b.report("new compare") do
t1 = Test.new
t1.new_compare(storage2)
end
end
Results:
Test::SIZE = 500
user system total real
original compare 0.280000 0.000000 0.280000 ( 0.285366)
user system total real
new compare 0.020000 0.000000 0.020000 ( 0.020458)
Test::SIZE = 1000
user system total real
original compare 28.140000 0.110000 28.250000 ( 28.618907)
user system total real
new compare 1.930000 0.010000 1.940000 ( 1.956868)
Test::SIZE = 5000
ruby test.rb
user system total real
original compare113.100000 0.440000 113.540000 (115.041267)
user system total real
new compare 7.680000 0.020000 7.700000 ( 7.739120)
Test::SIZE = 10000
user system total real
original compare453.320000 1.760000 455.080000 (460.549246)
user system total real
new compare 30.840000 0.110000 30.950000 ( 31.226218)

Related

Why is this array building method so slow?

This method is taking over 7 seconds with 50 markets and 2,500 flows (~250,000 iterations). Why so slow?
def matrix
[:origin, :destination].collect do |location|
markets.collect do |market|
network.flows.collect { |flow| flow[location] == market ? 1 : 0 }
end
end.flatten
end
I know that the slowness comes from the comparison of one market to another market based on benchmarks that I've run.
Here are the relevant parts of the class that's being compared.
module FreightFlow
class Market
include ActiveAttr::Model
attribute :coordinates
def ==(value)
coordinates == value.coordinates
end
end
end
What's the best way to make this faster?

You are constructing 100 intermediate collections (2*50) comprising of a total of 250,000 (2*50*2500) elements, and then flattening it at the end. I would try constructing the whole data structure in one pass. Make sure that markets and network.flows are stored in a hash or set. Maybe something like:
def matrix
network.flows.collect do |flow|
(markets.has_key? flow[:origin] or
markets.has_key? flow[:destination]) ? 1 : 0
end
end

This is a simple thing but it can help...
In your innermost loop you're doing:
network.flows.collect { |flow| flow[location] == market ? 1 : 0 }
Instead of using the ternary statement to convert to 1 or 0, use true and false Booleans instead:
network.flows.collect { |flow| flow[location] == market }
This isn't a big difference in speed, but over the course of that many nested loops it adds up.
In addition, it allows you to simplify your tests using the matrix being generated. Instead of having to compare to 1 or 0, you can simplify your conditional tests to if flow[location], if !flow[location] or unless flow[location], again speeding up your application a little bit for each test. If those are deeply nested in loops, which is very likely, that little bit can add up again.
Something that is important to do, when speed is important, is use Ruby's Benchmark class to test various ways of doing the same task. Then, instead of guessing, you KNOW what works. You'll find lots of questions on Stack Overflow where I've supplied an answer that consists of a benchmark showing the speed differences between various ways of doing something. Sometimes the differences are very big. For instance:
require 'benchmark'
puts `ruby -v`
def test1()
true
end
def test2(p1)
true
end
def test3(p1, p2)
true
end
N = 10_000_000
Benchmark.bm(5) do |b|
b.report('?:') { N.times { (1 == 1) ? 1 : 0 } }
b.report('==') { N.times { (1 == 1) } }
b.report('if') {
N.times {
if (1 == 1)
1
else
0
end
}
}
end
Benchmark.bm(5) do |b|
b.report('test1') { N.times { test1() } }
b.report('test2') { N.times { test2('foo') } }
b.report('test3') { N.times { test3('foo', 'bar') } }
b.report('test4') { N.times { true } }
end
And the results:
ruby 1.9.3p392 (2013-02-22 revision 39386) [x86_64-darwin10.8.0]
user system total real
?: 1.880000 0.000000 1.880000 ( 1.878676)
== 1.780000 0.000000 1.780000 ( 1.785718)
if 1.920000 0.000000 1.920000 ( 1.914225)
user system total real
test1 2.760000 0.000000 2.760000 ( 2.760861)
test2 4.800000 0.000000 4.800000 ( 4.808184)
test3 6.920000 0.000000 6.920000 ( 6.915318)
test4 1.640000 0.000000 1.640000 ( 1.637506)
ruby 2.0.0p0 (2013-02-24 revision 39474) [x86_64-darwin10.8.0]
user system total real
?: 2.280000 0.000000 2.280000 ( 2.285408)
== 2.090000 0.010000 2.100000 ( 2.087504)
if 2.350000 0.000000 2.350000 ( 2.363972)
user system total real
test1 2.900000 0.010000 2.910000 ( 2.899922)
test2 7.070000 0.010000 7.080000 ( 7.092513)
test3 11.010000 0.030000 11.040000 ( 11.033432)
test4 1.660000 0.000000 1.660000 ( 1.667247)
There are two different sets of tests. The first is looking to see what the differences are with simple conditional tests vs. using == without a ternary to get just the Booleans. The second is to test the effect of calling a method, a method with a single parameter, and with two parameters, vs. "inline-code" to find out the cost of the setup and tear-down when calling a method.
Modern C compilers do some amazing things when they analyze the code prior to emitting the assembly language to be compiled. We can fine-tune them to write for size or speed. When we go for speed, the program grows as the compiler looks for loops it can unroll and places it can "inline" code, to avoid making the CPU jump around and throwing away stuff that's in the cache.
Ruby is much higher up the language chain, but some of the same ideas still apply. We can write in a very DRY manner, and avoid repetition and use methods and classes to abstract our data, but the cost is reduced processing speed. The answer is to write your code intelligently and don't waste CPU time and unroll/inline where necessary to gain speed and other times be DRY to make your code more maintainable.
It's all a balancing act, and there's a time for writing both ways.

Memoizing the indexes of the markets within the flows was way faster than any other solution. Time reduced from ~30 seconds when the question was asked to 0.6 seconds.
First, I added a flow_index in the Network class. It stores the indexes of the flows that contain the markets.
def flow_index
#flow_index ||= begin
flow_index = {}
[:origin, :destination].each do |location|
flow_index[location] = {}
flows.each { |flow| flow_index[location][flow[location]] = [] }
flows.each_with_index { |flow, i| flow_index[location][flow[location]] << i }
end
flow_index
end
end
Then, I refactored the matrix method to use the flow index.
def matrix
base_row = network.flows.count.times.collect { 0 }
[:origin, :destination].collect do |location|
markets.collect do |market|
row = base_row.dup
network.flow_index[location][market].each do |i|
row[i] = 1
end
row
end
end.flatten
end
The base_row is created with all 0s and you just replace with 1s at the locations from the flow_index for that market.

What is the most performant way of processing this large text file?

When I read a text file into memory it brings my text in with '\n' at the end due to the new lines.
["Hello\n", "my\n", "name\n", "is\n", "John\n"]
Here is how I am reading the text file
array = File.readlines('text_file.txt')
I need to do a lot of processing on this text array, so I'm wondering if I should remove the "\n" when I first create the array, or when I do the processing on each element with regex, performance wise.
I wrote some (admittedly bad) test code to remove the "\n"
array = []
File.open('text_file.txt', "r").each_line do |line|
data = line.split(/\n/)
array << data
end
array.flatten!
Is there a better way to do this if I should remove the "\n" when I first create the array?
If I wanted to read the file into a Set instead(for performance), is there a method similar to readlines to do that?

You need to run a benchmark test, using Ruby's built-in Benchmark to figure out what is your fastest choice.
However, from experience, I've found that "slurping" the file, i.e., reading it all in at once, is not any faster than using a loop with IO.foreach or File.foreach. This is because Ruby and the underlying OS do file buffering as the reads occur, allowing your loop to occur from memory, not directly from disk. foreach will not strip the line-terminators for you, like split would, so you'll need to add a chomp or chomp! if you want to mutate the line read in:
File.foreach('/path/to/file') do |li|
puts li.chomp
end
or
File.foreach('/path/to/file') do |li|
li.chomp!
puts li
end
Also, slurping has the problem of not being scalable; You could end up trying to read a file bigger than memory, taking your machine to its knees, while reading line-by-line will never do that.
Here's some performance numbers:
#!/usr/bin/env ruby
require 'benchmark'
require 'fileutils'
FILENAME = 'test.txt'
LOOPS = 1
puts "Ruby Version: #{RUBY_VERSION}"
puts "Filesize being read: #{File.size(FILENAME)}"
puts "Lines in file: #{`wc -l #{FILENAME}`.split.first}"
Benchmark.bm(20) do |x|
x.report('read.split') { LOOPS.times { File.read(FILENAME).split("\n") }}
x.report('read.lines.chomp') { LOOPS.times { File.read(FILENAME).lines.map(&:chomp) }}
x.report('readlines.map.chomp1') { LOOPS.times { File.readlines(FILENAME).map(&:chomp) }}
x.report('readlines.map.chomp2') { LOOPS.times { File.readlines(FILENAME).map{ |s| s.chomp } }}
x.report('foreach.map.chomp1') { LOOPS.times { File.foreach(FILENAME).map(&:chomp) }}
x.report('foreach.map.chomp2') { LOOPS.times { File.foreach(FILENAME).map{ |s| s.chomp } }}
end
And the results:
Ruby Version: 1.9.3
Filesize being read: 42026131
Lines in file: 465440
user system total real
read.split 0.150000 0.060000 0.210000 ( 0.213365)
read.lines.chomp 0.470000 0.070000 0.540000 ( 0.541266)
readlines.map.chomp1 0.450000 0.090000 0.540000 ( 0.535465)
readlines.map.chomp2 0.550000 0.060000 0.610000 ( 0.616674)
foreach.map.chomp1 0.580000 0.060000 0.640000 ( 0.641563)
foreach.map.chomp2 0.620000 0.050000 0.670000 ( 0.662912)
On today's machines a 42MB file can be read into RAM pretty safely. I have seen files a lot bigger than that which won't fit into the memory of some of our production hosts. While foreach is slower, it's also not going to take a machine to its knees by sucking up all memory if there isn't enough memory.
On Ruby 1.9.3, using the map(&:chomp) method, instead of the older form of map { |s| s.chomp }, is a lot faster. That wasn't true with older versions of Ruby, so caveat emptor.
Also, note that all the above processed the data in less than one second on my several years old Mac Pro. All in all I'd say that worrying about the load speed is premature optimization, and the real problem will be what is done after the data is loaded.

I'd use String#chomp:
lines = open('text_file.txt').lines.map(&:chomp)

If you want to get rid of the ending newline character you can either String#chomp or String#rstrip. My preferred method would be chomp.
So you can easily do something like:
lines.map! { |line| line.chomp }
# or
lines.map! { |line| line.rstrip }

mvelez#argo:~$ cat test.txt
Hello
my
name
is
John
One liner:
arr = File.open("test.txt",'r').read.split
Decomposing this in irb
irb(main):002:0> f = File.open("test.txt",'r')
=> #<File:test.txt>
irb(main):003:0> file_contents = f.read
=> "Hello\nmy\nname\nis\nJohn\n\n"
irb(main):004:0> file_contents.split
=> ["Hello", "my", "name", "is", "John"]

I'ld prefer using strip over split in these cases, and doing it right after handling the line for the first time. Using split after readline is overkill imo. So the code snippet would be
array = []
File.open('text_file.txt', "r").each_line do |line|
array << data.strip
end

XML data scanner with Ruby: case statement or dynamic dispatch in term of performance?

I'm writing a XML data scanner which read XML text using some XML parser library like nokogiri or such,
and generate a tree of nodes. I need to create an object per a XML element.
So, I need a method which creates an object according to given element name and attributes, like this,
regardless of kind of the parser library options (either SAX or DOM) I'm using:
create_node(name, attributes_hash)
This method need to branch according to the name. Implementation possibilities are:
Case statement
Method dispatch and pre-defined methods
Since this method possibly become a bottleneck, I wrote a benchmark script to check how Ruby perform. (The benchmark script attached at last part of this question. I don't like some part of the script -- particularly how to create case statement --, so comments to how I can improve this is also welcome, but please provide it as comments not an answer... I probably need to create a question for that too..).
The script measures following four cases, in two range sizes:
method dispatch with constant name
method dispatch with name concatenate with #{}
method dispatch with name concatenate with +
using case statement, call the same methods
Results:
user system total real
a to z: method_calls (with const name) 0.090000 0.000000 0.090000 ( 0.092516)
a to z: method_calls (with dynamic name) 1 0.180000 0.000000 0.180000 ( 0.181793)
a to z: method_calls (with dynamic name) 2 0.200000 0.000000 0.200000 ( 0.202818)
a to z: switch_calls 0.130000 0.000000 0.130000 ( 0.132633)
user system total real
a to zz: method_calls (with const name) 2.900000 0.000000 2.900000 ( 2.894273)
a to zz: method_calls (with dynamic name) 1 6.500000 0.010000 6.510000 ( 6.507099)
a to zz: method_calls (with dynamic name) 2 6.980000 0.000000 6.980000 ( 6.987534)
a to zz: switch_calls 4.750000 0.000000 4.750000 ( 4.742448)
I observe const name based method dispatch is faster than using case statement, however, if string operation is involved when determine the method name, the costs to determine the method name costs more than actual method call costs, effectively make these options(2 and 3) slower than option 4. Also, the difference between option 2 and 3 are negligible.
To make the scanner secure, I prefer to have some prefix to the methods, since without that, it is possible to craft a XML to invoke some methods, which I don't want to happen. But the cost to determine the method name is not negligible.
How do you write these scanner? I want to know an answer to following questions:
Is there any good scheme other than above?
If not, which (case-when or method dispatch) scheme you choose?
If I don't compute method name, it is faster. Is there any good way to do method dispatch securely? (by limiting node name to be called, for example.)
The benchmark script
# Benchmark to measure the difference of
# use of case statement and message passing
require 'benchmark'
def bench(title, tobj, count)
Benchmark.bmbm do |b|
b.report "#{title}: method_calls (with const name)" do
(1..count).each do |c|
tobj.run_send_using_const
end
end
b.report "#{title}: method_calls (with dynamic name) 1" do
(1..count).each do |c|
tobj.run_send_using_dynamic_1
end
end
b.report "#{title}: method_calls (with dynamic name) 2" do
(1..count).each do |c|
tobj.run_send_using_dynamic_2
end
end
b.report "#{title}: switch_calls" do
(1..count).each do |c|
tobj.run_switch
end
end
end
end
class Switcher
def initialize(names)
#method_names = { }
#names = names
names.each do |n|
#method_names[n] = "dynamic_#{n}"
##n = n
class << self
mname = "dynamic_#{##n}"
define_method(mname) do
mname
end
end
end
swst = ""
names.each do |n|
swst << "when \"#{n}\" then dynamic_#{n}\n"
end
st = "
def run_switch_each(n)
case n
#{swst}
end
end
"
eval(st)
end
def run_send_using_const
#method_names.each_value do |n|
self.send n
end
end
def run_send_using_dynamic_1
#names.each do |n|
self.send "dynamic_#{n}"
end
end
def run_send_using_dynamic_2
#names.each do |n|
self.send "dynamic_" + n
end
end
def run_switch
#names.each do |n|
run_switch_each(n)
end
end
end
sw1 = Switcher.new('a'..'z')
sw2 = Switcher.new('a'..'zz')
bench("a to z", sw1, 10000)
bench("a to zz", sw2, 10000)

I believe this is a case of premature optimization.
But the cost to determine the method name is not negligible.
Non-negligible compared to what? The approaches here have different performance numbers, but will the time taken to dispatch one node be comparable to what it takes to parse the node (with Nokogiri or etc), to construct the specialized node object, and do whatever you need with it?
I believe it won't. I don't have a benchmark to prove that statement (you need actual code for that), but the fact that string concatenation vs string interpolation actually makes a noticeable difference in the results (dynamic1 vs dynamic2) is a good indicator that you're measuring something trivial.
Or that adding one string concatenation per dispatch increases the resulting time 2-2.5 times (const vs dynamic2).

Measure user time or system time in Ruby without Benchmark or time

Since I'm doing some time measurements at the moment, I wondered if it is possible to measure the user time or system time without using the Benchmark class or the command line utility time.
Using the Time class only reveals the wall clock time, not system and user time, however I'm looking for a solution which has the same flexibility, e.g.
time = TimeUtility.now
# some code
user, system, real = TimeUtility.now - time
The reason is that I somehow dislike Benchmark, since it cannot return numbers only (EDIT: I was wrong - it can. See answers below.). Sure, I could parse the output, but that doesn't feels right. The time utility from *NIX systems should solve my problem as well, but I wanted to know if there already is some kind of wrapper implemented in Ruby so I don't need to make these system calls by myself.
Thanks a lot!

I re-read the Benchmark documentation and saw that it has a method named measure. This method does exactly what I want: Measure the time your code needs and returning an object which contains user time, system time, system time of childrens etc. It is as easy as
require 'benchmark'
measurement = Benchmark.measure do
# your code goes here
end
In the process I found out that you can add custom rows to the Benchmark output. You can use this to get the best of both worlds (custom time measurements and a nice output at the end) as follows:
require 'benchmark'
measurements = []
10.times { measurements << Benchmark.measure { 1_000_000.times { a = "1" } } }
# measurements.sum or measurements.inject(0){...} does not work, since the
# array contains Benchmark instances, which cannot be coerced into Fixnum's
# Array#sum will work if you are using Rails
sum = measurements.inject(nil) { |sum, t| sum.nil? ? sum = t : sum += t }
avg = sum / measurements.size
# 7 is the width reserved for the description "sum:" and "avg:"
Benchmark.bm(7, "sum:", "avg:") do |b|
[sum, avg]
end
The result will look like the following:
user system total real
sum: 2.700000 0.000000 2.700000 ( 2.706234)
avg: 0.270000 0.000000 0.270000 ( 0.270623)

You could use the Process::times function, which returns user time/system time. (It does not report wall clock time, you'll need something else for that). Seems to be a bit version or OS dependent though.
This is what it reports on my system (linux, ruby 1.8.7):
$ irb
irb(main):001:0> t = Process.times
=> #<struct Struct::Tms utime=0.01, stime=0.0, cutime=0.0, cstime=0.0>
The docs show this though, so some versions/implementations might only have the first two:
t = Process.times
[ t.utime, t.stime ] #=> [0.0, 0.02]
See times for the underlying call on Linux.
Here's a really crappy wrapper that kind of supports -:
class SysTimes
attr_accessor :user, :system
def initialize
times = Process.times
#user = times.utime
#system = times.stime
end
def -(other)
diff = SysTimes.new
diff.user = #user - other.user
diff.system = #system - other.system
diff
end
end
Should give you ideas to make it work nicely in your context.

This gem might help:
https://github.com/igorkasyanchuk/benchmark_methods
No more code like this:
t = Time.now
user.calculate_report
puts Time.now - t
Now you can do:
benchmark :calculate_report # in class
And just call your method
user.calculate_report

Which of these Array Initializations is better in Ruby?

Which of these two forms of Array Initialization is better in Ruby?
Method 1:
DAYS_IN_A_WEEK = (0..6).to_a
HOURS_IN_A_DAY = (0..23).to_a
#data = Array.new(DAYS_IN_A_WEEK.size).map!{ Array.new(HOURS_IN_A_DAY.size) }
DAYS_IN_A_WEEK.each do |day|
HOURS_IN_A_DAY.each do |hour|
#data[day][hour] = 'something'
end
end
Method 2:
DAYS_IN_A_WEEK = (0..6).to_a
HOURS_IN_A_DAY = (0..23).to_a
#data = {}
DAYS_IN_A_WEEK.each do |day|
HOURS_IN_A_DAY.each do |hour|
#data[day] ||= {}
#data[day][hour] = 'something'
end
end
The difference between the first method and the second method is that the second one does not allocate memory initially. I feel the second one is a bit inferior when it comes to performance due to the numerous amount of Array copies that has to happen.
However, it is not straight forward in Ruby to find what is happening. So, if someone can explain me which is better, it would be really great!
Thanks

Before I answer the question you asked, I'm going to answer the question you should have asked but didn't:
Q: Should I focus on making my code readable first, or should I focus on performance first?
A: Make your code readable and correct first, then, and only if there is a performance problem, start to worry about performance by measuring where the performance problem is first and only then making changes to your code.
Now to answer the question you asked, but shouldn't have:
method1.rb:
DAYS_IN_A_WEEK = (0..6).to_a
HOURS_IN_A_DAY = (0..23).to_a
10000.times do
#data = Array.new(DAYS_IN_A_WEEK.size).map!{ Array.new(HOURS_IN_A_DAY.size) }
DAYS_IN_A_WEEK.each do |day|
HOURS_IN_A_DAY.each do |hour|
#data[day][hour] = 'something'
end
end
end
method2.rb:
DAYS_IN_A_WEEK = (0..6).to_a
HOURS_IN_A_DAY = (0..23).to_a
10000.times do
#data = {}
DAYS_IN_A_WEEK.each do |day|
HOURS_IN_A_DAY.each do |hour|
#data[day] ||= {}
#data[day][hour] = 'something'
end
end
end
Results of brain-dead benchmark:
$ time ruby method1.rb
real 0m1.189s
user 0m1.140s
sys 0m0.000s
$ time ruby method2.rb
real 0m1.879s
user 0m1.780s
sys 0m0.020s
Looks to me like user time usage (the important factor) has method1.rb a lot faster. You, of course, should not trust this benchmark and should make your own reflecting your actual code use. This, however, is something you should do only after you have determined which code is your performance bottleneck in reality. (Hint: 99.44% of computer programmers are 100% wrong when they guess where their bottlenecks are without measuring!)

What's wrong with just
#data = Array.new(7) { Array.new(24) { 'something' }}
Or, if you are content having the same object everywhere:
#data = Array.new(7) { Array.new(24, 'something') }
It's much faster, not that it would matter. It is also much more readable, which is the most important thing. After all, the purpose of code is communicating intent to the other stakeholders, not communicating with the computer. user system total real
method1 8.969000 0.000000 8.969000 ( 9.059570)
method2 16.547000 0.000000 16.547000 (16.799805)
method3 6.468000 0.000000 6.468000 ( 6.616211)
method4 0.969000 0.015000 0.984000 ( 1.021484)That last line also shows another interesting thing: the runtime is dominated by the time needed to create the 7*24*100000 = 16.8 million 'something' strings.
And of course there is another important obversation: your method1 and method2 that you are comparing against each other do two completely different things! It doesn't even make sense to compare them against each other. method1 creates an Array, method2 creates a Hash.
Your method1 is equivalent to my first example above:
#data = Array.new(7) { Array.new(24) { 'something' }}
While method2 is (very roughly) equivalent to:
#data = Hash.new {|h, k| h[k] = Hash.new {|h, k| h[k] = 'something' }}
Well, except that your method2 initializes the entire Hash eagerly, while my method only executes the initialization code lazily in case an uninitialized key is read.
In other words, after running the above initialization code, the Hash is still empty:
#data # => {}
But whenever you try to access a key, it will magically appear:
#data[5][17] # => 'something'
And it will stay there:
#data # => {5 => {17 => 'something'}}
Since this code doesn't actually initialize the Hash, it is obviously way faster: user system total real
method5 0.266000 0.000000 0.266000 ( 0.296875)

I wrapped both of the code snippets into separate methods and did some benchmarking. Here are the results:
Benchmark.bm(7) do |x|
x.report ("method1") { 100000.times { method1 } }
x.report ("method2") { 100000.times { method2 } }
end
user system total real
method1 11.370000 0.010000 11.380000 ( 11.392233)
method2 17.920000 0.010000 17.930000 ( 18.328318)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Ruby: Comparing two Arrays of Hashes - ruby

Related

Why is this array building method so slow?

What is the most performant way of processing this large text file?

XML data scanner with Ruby: case statement or dynamic dispatch in term of performance?

Measure user time or system time in Ruby without Benchmark or time

Which of these Array Initializations is better in Ruby?

Categories

Resources