Performance of Sets V.S. Arrays in Ruby - ruby

In Ruby, I am building a method which constructs and returns a (probably large) array which should contain no duplicate elements. Would I get better performance by using a set and then converting that to an array? Or would it be better to just call .uniq on the array I am using before I return it? Or what about using & to append items to the array instead of +=? And if I do use a set, would not having a <=> method on the object I am putting into the set have an effect on performance? (If you're not sure, do you know of a way to test this?)

The real answer is: write the most readable and maintainable code, and optimize it only after you've shown it is a bottleneck. If you can find an algorithm in that is in linear time, you won't have to optimize it. Here it's easy to find...
Not quite sure which methods you are suggesting, but using my fruity gem:
require 'fruity'
require 'set'
enum = 1000.times
compare do
uniq { enum.each_with_object([]){|x, array| array << x}.uniq }
set { enum.each_with_object(Set[]){|x, set| set << x}.to_a }
join { enum.inject([]){|array, x| array | [x]} }
end
# set is faster than uniq by 10.0% ± 1.0%
# uniq is faster than join by 394x ± 10.0
Clearly, it makes no sense building intermediate arrays like in the third method. Otherwise, it's not going to make a big difference since you will be in O(n); that's the main thing.
BTW, both sets, uniq and Array#| use eql? and hash on your objects, not <=>. These need to be defined in a sane manner, because the default is that objects are never eql? unless they have the same object_id (see this question)

Have you tried using the Benchmark library? Tests are usually very easy to construct and will properly reflect how it works in your particular version of Ruby.

Related

Is there a benefit to using <=> rather than just sorting and reversing?

What are the benefits (if any) of doing:
books.sort! { |firstBook, secondBook| secondBook <=> firstBook }
versus:
books.sort!.reverse!
The second option just seems so much cleaner and easier to understand..
edit: I guess this might be a question of, what are other uses of the <=> operator other than 1-to-1 sorting?
My initial answer about performance concerns has proven to be largely based on an incorrect assumption: There is no performance impact inherent in sort.reverse, as sort with no block appears to be faster than sort with a block, so much so that it offsets the cost of a second reverse call, which is negligible.
However, the gist of my answer remains valid: You should choose the second line because it is more readable, and worry about finding out which is the faster option when you identify a performance problem.
Original answer follows:
The second option is more expensive. It sorts everything in ascending order and then reverses the array, two distinct processes, while the first option produces the array in descending order immediately.
That said, the second option is the one I'd prefer. Generally, prefer producing readable, maintainable code over prematurely optimizing for performance.
Obviously you have to ask yourself: "Does this code run many times per second?" or "Does this code run once in the lifetime of the app?" and your priorities will change accordingly, but generally, maintainability trumps performance.
Use the second option until you can prove that it's a performance bottleneck.
I was pretty surprised but it seems that the second option is faster, and not just by small %.
require 'benchmark'
array = (1..10**7).to_a.shuffle
Benchmark.bm do |x|
x.report { array.sort { |firstBook, secondBook| secondBook <=> firstBook }}
x.report { array.sort.reverse }
end
Results:
user system total real
21.090000 0.030000 21.120000 ( 21.135562)
2.060000 0.020000 2.080000 ( 2.098318)

Various ways of creating Objects in Ruby

Are these methods of creating an empty Ruby Hash different? If so how?
myHash = Hash.new
myHash = {}
I'd just like a solid understanding of memory management in Ruby.
There are many ways you can create a Hash object in Ruby, though the end result is the same sort of object:
hash = { }
hash = Hash.new
hash = Hash[]
hash = some_object.to_h
hash = YAML.load("--- {}\n\n")
As far as memory considerations go, an empty Hash is significantly smaller than one with even a singular value in it. Arrays tend to be smaller than Hashes at small sizes, but will be more efficient at larger scales.
In practice, though, the important thing to remember in Ruby is that every time you create an object it costs you something, even if it's only an infinitesimal amount. These little hits add up if you're needlessly creating billions of objects.
Generally you should avoid creating structures that will not be used, and instead create them on demand if that wouldn't complicate things needlessly. For example, a typical pattern is:
def cache
#cache ||= { }
end
Until this method is called, the cache Hash is never defined. The memory savings in this instance is nearly insignificant, but if that was loading a large configuration file or importing several hundred MB of data from a database you can imagine the savings would be significant in those instances where that data is not exercised.
The two methods are exactly equivalent.
As is mentioned above, the two are operationally equivalent. If you're referring to the standard MRI / YARV; perhaps this thread would help: http://www.ruby-forum.com/topic/215163#new.
With the Hash.new syntax you can specify what to do when some key is absent in the hash (the default behaviour). With the the {} syntax it takes another step.
my_hash = Hash[]
is another way of creating an array; the [] methods takes an even number of arguments.
my_hash = Hash[:a, 1, :b, 2]
This has nothing to do with memory management.

Performance using the Ruby 'include?' method

I would like to know how much the include? method can affect the performance if I have something like this:
array = [<array_values>] # Read above for more information
(0..<n_iterations>).each { |value| # Read above for more information
array.include?(value)
}
In cases <array_values> are 10, 100 and 1.000 and <n_iterations> are 10, 100, 1000.
Use a Set (or equivalently a Hash) instead of an array, so that include is O(1) instead of O(n).
Or if you have multiple include to do, you can use array intersection & or subtraction - which will build a temporary Hash to do the operation efficiently.
I think ruby-prof might be a good place to start. However, that performance data won't be useful without something else to compare this to. As in, "is the performance of this method better or worse than [some other method]?"
Also, note that as n_iterations increases larger than the size of array, this code will probably perform better, due to sheer number of #include? calls.
array.each do |value|
(0..<n_iterations>).map.include?(value)
end

Ruby Loops Question

C++:
for(i=0,j=0;i<0;i++,j++)
What's the equivalence to this in ruby?
Besides the normal for, while loop seen in C++. Can someone name off the other special loops ruby has? Such as .times? .each?
Thanks in advance.
If I understand your question (at least the first part of it), you are wondering how you can iterate two separate variables at the same time, such as i and j.
You can do that in Ruby using the for loop, with multiple variables. For instance, if you wanted i to count up from 1 to 10, and j to count from 10 to 20, you could do:
for i, j in (1..10).zip(10..20)
puts "#{i}, #{j}"
end
zip will produce, from two arrays, a single array of which each element is an array, with the first element taken from the corresponding position in the first array, and the second element taken from the corresponding position in the second array:
> [1, 2, 3].zip([4, 5, 6])
=> [[1, 4], [2, 5], [3, 6]]
And using i, j in your for loop will take i from the first element of each inner array, and j from the second element.
If you'd rather use each than for, you can just use a block with two parameters:
(1..10).zip(10..20).each { |i, j| puts "#{i}, #{j}" }
As to the second part of your question, Ruby doesn't really have a fixed number of different iterators, since most iteration is done by passing a block to a method, and thus any class can define its own methods that allow iterating over its own contents. The most common is each, and any class that defines an each method can mix in the Enumerable class, which gives you a variety of different methods for iterating over elements, selecting elements, filtering, and so on. There are also times, upto, and downto defined on the Integer class, each_key, each_value, each_pair on Hash, each_byte, each_char, each_line on String, and so on. Just about any class that defines some sort of collection or sequence has methods for iterating over said collection or sequence.
Ruby is different to C++. In C++ you use a for loop to loop through anything, but in Ruby you'll find you're usually looping through an enumerable object, so it's more common to do something like:
monkeys.each do |monkey|
monkey.say 'ow!'
end
Don't try to look for too much equivalence between the two languages - they're built for different things. Obviously there are a lot of equivalent things, but you can't learn Ruby by producing a chart that shows C++ code on one side and the Ruby equivalent on the other. Try to learn the idiomatic way of doing things and you'll find it much easier.
If you want ways of looping through enumerable objects, check out all the methods in Module: Enumerable: all? any? collect detect each_cons each_slice each_with_index entries enum_cons enum_slice enum_with_index find find_all grep include? inject inject map max member? min partition reject select sort sort_by to_a to_set zip. With most of these methods you'd use a for loop to do the equivalent thing in C++.
You can do:
(0..j).each do |i|
puts i
end
I am not terribly familiar with C++, but AFAICS, the equivalent Ruby code to the loop you posted is simply:
i, j = 0, 0
Which shows once again the expressive power Ruby has. Anybody can figure out what this does, even if he has never seen Ruby before, while the equivalent C++ takes quite a while to figure out.

How is Array#sort in Ruby so quick?

C# has BitArray, C has bit fields.. I couldn't find an equivalent in the Ruby core. Google showed me a BitField class that Peter Cooper has written for the same.
I've been reading Jon Bentley's Programming Pearls and while trying out one of the first examples, which deals with a BitMap sort - I needed a type that is an array of bits. I used Peter's class
class BitMapSort
def self.sort(list, value_range_max)
bitmap = BitField.new(value_range_max)
list.each{|x| bitmap[x] = 1 }
sorted_list = []
0.upto(value_range_max-1){ |number|
sorted_list << number if (bitmap[number] == 1)
}
sorted_list
end
end
Running this on a set of 1M unique numbers in the range [0, 10,000,000), produced some interesting results,
user system total real
bitmap 11.078000 0.015000 11.093000 ( 11.156250)
ruby-system-sort 0.531000 0.000000 0.531000 ( 0.531250)
quick-sort 21.562000 0.000000 21.562000 ( 21.625000)
Benchmark.bm(20){|x|
x.report("bitmap"){ ret = BitMapSort.sort(series, 10_000_000);}
x.report("ruby-system-sort"){ ret = series.sort; }
x.report("quick-sort"){ ret = QuickSort.sort( series, 0, series.length-1); }
}
How is ruby's default sort 22x faster than 1M BitField.set + 1 loop over a 10M bit vector ? Is there a more efficient bit field / array in Ruby ? How does Ruby's default sort achieve this level of performance.. is it jumping into C to get this done?
Array#sort is implemented in C, see rb_ary_sort in array.c
It also has some checks to compare Fixnums so sorting an array of Integers doesn't even need method lookups.
How does Ruby's default sort achieve this level of performance.. is it jumping into C to get this done?
All of the core classes and methods in ruby's default implementation are implemented in C.
The reason why it's so much faster is probably because it's implemented in the ruby implementation in C.
I think the real problem here is you're doing 10M comparisons, 10M array fetches, 10M of a lot of things, whereas a properly optimized sort routine is doing far fewer operations since it is working with a fixed set of 1M items.
Basic operations such as sort are highly optimized in the Ruby VM and are difficult to beat with a pure-Ruby alternative.
Jumping into C is exactly correct. Array and Hash both have C implementations of many methods in order to improve performance. Integer and float literals also have some tricky code optimizations. When you convert them to a bitmap you loose this optimization as well.
With a compiled language like C or Java it really makes sense to look for tricky optimization patterns. with interpreted languages the cost of interpreting each command make this counter productive.

Resources