Performance using the Ruby 'include?' method - ruby

I would like to know how much the include? method can affect the performance if I have something like this:
array = [<array_values>] # Read above for more information
(0..<n_iterations>).each { |value| # Read above for more information
array.include?(value)
}
In cases <array_values> are 10, 100 and 1.000 and <n_iterations> are 10, 100, 1000.

Use a Set (or equivalently a Hash) instead of an array, so that include is O(1) instead of O(n).
Or if you have multiple include to do, you can use array intersection & or subtraction - which will build a temporary Hash to do the operation efficiently.

I think ruby-prof might be a good place to start. However, that performance data won't be useful without something else to compare this to. As in, "is the performance of this method better or worse than [some other method]?"
Also, note that as n_iterations increases larger than the size of array, this code will probably perform better, due to sheer number of #include? calls.
array.each do |value|
(0..<n_iterations>).map.include?(value)
end

Related

Why must we call to_a on an enumerator object?

The chaining of each_slice and to_a confuses me. I know that each_slice is a member of Enumerable and therefore can be called on enumerable objects like arrays, and chars does return an array of characters.
I also know that each_slice will slice the array in groups of n elements, which is 2 in the below example. And if a block is not given to each_slice, then it returns an Enumerator object.
'186A08'.chars.each_slice(2).to_a
But why must we call to_a on the enumerator object if each_slice has already grouped the array by n elements? Why doesn't ruby just evaluate what the enumerator object is (which is a collection of n elements)?
The purpose of enumerators is lazy evaluation. When you call each_slice, you get back an enumerator object. This object does not calculate the entire grouped array up front. Instead, it calculates each “slice” as it is needed. This helps save on memory, and also allows you quite a bit of flexibility in your code.
This stack overflow post has a lot of information in it that you’ll find useful:
What is the purpose of the Enumerator class in Ruby
To give you a cut and dry answer to your question “Why must I call to_a when...”, the answer is, it hasn’t. It hasn’t yet looped through the array at all. So far it’s just defined an object that says that when it goes though the array, you’re going to want elements two at a time. You then have the freedom to either force it to do the calculation on all elements in the enumerable (by calling to_a), or you could alternatively use next or each to go through and then stop partway through (maybe calculate only half of them as opposed to calculating all of them and throwing the second half away).
It’s similar to how the Range class does not build up the list of elements in the range. (1..100000) doesn’t make an array of 100000 numbers, but instead defines an object with a min and max and certain operations can be performed on that. For example (1..100000).cover?(5) doesn’t build a massive array to see if that number is in there, but instead just sees if 5 is greater than or equal to 1 and less than or equal to 100000.
The purpose of this all is performance and flexibility.
It may be worth considering whether your implementation actually needs to make an array up front, or whether you can actually keep your RAM consumption down a bit by iterating over the enumerator. (If your real world scenario is as simple as you described, an enumerator won’t help much, but if the array actually is large, an enumerator could help you a lot).

Bubble Sort method

I am just learning ruby and KevinC's response (in this link) makes sense to me with one exception. I don't understand why the code is encompassed in the arr.each do |i| #while... end That part seems redundant to me as the 'while' loop is already hitting each of the positions? Can someone explain?
The inner loop finds a bubble and carries it up; if it finds another, lighter bubble, it switches them around and carries the lighter one. So you need several passes through the array to find all the bubbles and carry them to the correct place, since you can't float several bubbles at the same time.
EDIT:
The each is really misused in KevinC's code, since it is not used for its normal purpose: yielding elements of the collection. Instead of arr.each, it would be better to use arr.size.times - as it would be more informative to the reader. Redefining the i within the block is adding insult to injury. While none of this will cause the code to be wrong as such, it is misleading.
The other problem with the code is the fact that it does not provide the early termination condition (swapped in most other answers on that question). In theory, bubble sort could find the array sorted in the first pass; the other size - 1 steps are unnecesary. KevinC's code would still dry-hump the already sorted array, never realising it is done.
As for rewrite into block-less code, it is certainly possible, but you need to understand that blocks syntax is very idiomatic in Ruby, and non-block loops are almost unheard of in Ruby world. While Ruby has for, it is pretty much never used in Ruby. But...
arr.each do |i|
...
end
is equivalent to
for i in arr
...
end
which is, again, at least for the array case, equivalent to
index = 0
while index < arr.size
i = arr[index]
...
index += 1
end

Performance of Sets V.S. Arrays in Ruby

In Ruby, I am building a method which constructs and returns a (probably large) array which should contain no duplicate elements. Would I get better performance by using a set and then converting that to an array? Or would it be better to just call .uniq on the array I am using before I return it? Or what about using & to append items to the array instead of +=? And if I do use a set, would not having a <=> method on the object I am putting into the set have an effect on performance? (If you're not sure, do you know of a way to test this?)
The real answer is: write the most readable and maintainable code, and optimize it only after you've shown it is a bottleneck. If you can find an algorithm in that is in linear time, you won't have to optimize it. Here it's easy to find...
Not quite sure which methods you are suggesting, but using my fruity gem:
require 'fruity'
require 'set'
enum = 1000.times
compare do
uniq { enum.each_with_object([]){|x, array| array << x}.uniq }
set { enum.each_with_object(Set[]){|x, set| set << x}.to_a }
join { enum.inject([]){|array, x| array | [x]} }
end
# set is faster than uniq by 10.0% ± 1.0%
# uniq is faster than join by 394x ± 10.0
Clearly, it makes no sense building intermediate arrays like in the third method. Otherwise, it's not going to make a big difference since you will be in O(n); that's the main thing.
BTW, both sets, uniq and Array#| use eql? and hash on your objects, not <=>. These need to be defined in a sane manner, because the default is that objects are never eql? unless they have the same object_id (see this question)
Have you tried using the Benchmark library? Tests are usually very easy to construct and will properly reflect how it works in your particular version of Ruby.

Various ways of creating Objects in Ruby

Are these methods of creating an empty Ruby Hash different? If so how?
myHash = Hash.new
myHash = {}
I'd just like a solid understanding of memory management in Ruby.
There are many ways you can create a Hash object in Ruby, though the end result is the same sort of object:
hash = { }
hash = Hash.new
hash = Hash[]
hash = some_object.to_h
hash = YAML.load("--- {}\n\n")
As far as memory considerations go, an empty Hash is significantly smaller than one with even a singular value in it. Arrays tend to be smaller than Hashes at small sizes, but will be more efficient at larger scales.
In practice, though, the important thing to remember in Ruby is that every time you create an object it costs you something, even if it's only an infinitesimal amount. These little hits add up if you're needlessly creating billions of objects.
Generally you should avoid creating structures that will not be used, and instead create them on demand if that wouldn't complicate things needlessly. For example, a typical pattern is:
def cache
#cache ||= { }
end
Until this method is called, the cache Hash is never defined. The memory savings in this instance is nearly insignificant, but if that was loading a large configuration file or importing several hundred MB of data from a database you can imagine the savings would be significant in those instances where that data is not exercised.
The two methods are exactly equivalent.
As is mentioned above, the two are operationally equivalent. If you're referring to the standard MRI / YARV; perhaps this thread would help: http://www.ruby-forum.com/topic/215163#new.
With the Hash.new syntax you can specify what to do when some key is absent in the hash (the default behaviour). With the the {} syntax it takes another step.
my_hash = Hash[]
is another way of creating an array; the [] methods takes an even number of arguments.
my_hash = Hash[:a, 1, :b, 2]
This has nothing to do with memory management.

What's the most efficient way to partition a large hash into N smaller hashes in Ruby?

The Problem
I'm working on a problem that involves sharding. As part of the problem I need to find the fastest way to partition a large Ruby hash (> 200,0000 entries) in two or more pieces.
Are there any non O(n) approaches?
Is there a non-Ruby i.e. C/C++ implementation?
Please don't reply with examples using the trivial approach of converting the hash to an array and rebuilding N distinct hashes.
My concern is that Ruby is too slow to do this kind of work.
The initial approach
This was the first solution I tried. What was appealing about it was:
it didn't need to loop slavishly across the hash
it didn't need to manage a counter to allocate the members evenly among the shards.
it's short and neat looking
Ok, it isn't O(n) but it relies on methods in the standard library which I figured would be faster than writing my own Ruby code.
pivot = s.size / 2
slices = s.each_slice(pivot)
s1 = Hash[*slices.entries[0].flatten]
s2 = Hash[*slices.entries[1].flatten]
A better solution
Mark and Mike were kind enough to suggest approaches. I have to admit that Mark's approach felt wrong - it did exactly what I didn't want - it looped over all of the members of the has and evaluated a conditional as it went - but since he'd taken the time to do the evaluation, I figured that I should try a similar approach and benchmark that. This is my adapted version of his approach (My keys aren't numbers so I can't take his approach verbatim)
def split_shard(s)
shard1 = {}
shard2 = {}
t = Benchmark.measure do
n = 0
pivot = s.size / 2
s.each_pair do |k,v|
if n < pivot
shard1[k] = v
else
shard2[k] = v
end
n += 1
end
end
$b += t.real
$e += s.size
return shard1, shard2
end
The results
In both cases, a large number of hashes are split into shards. The total number of elements across all of the hashes in the test data set was 1,680,324.
My initial solution - which had to be faster because it uses methods in the standard library and minimises the amount of Ruby code (no loop, no conditional) - runs in just over 9s
Mark's approach runs in just over 5s
That's a significant win
Take away
Don't be fooled by 'intuition' - measure the performance of competing algorithm
Don't worry about Ruby's performance as a language - my initial concern is that if I'm doing ten million of these operations, it could take a significant amount of time in Ruby but it doesn't really.
Thanks to Mark and Mike who both get points from me for their help.
Thanks!
I don't see how you can achieve this using an unmodified "vanilla" Hash - I'd expect that you'd need to get into the internals in order to make partitioning into some kind of bulk memory-copying operation. How good is your C?
I'd be more inclined to look into partitioning instead of creating a Hash in the first place, especially if the only reason for the 200K-item Hash existing in the first place is to be subdivided.
EDIT: After thinking about it at the gym...
The problem with finding some existing solution is that someone else needs to have (a) experienced the pain, (b) had the technical ability to address it and (c) felt community-friendly enough to have released it into the wild. Oh, and for your OS platform.
What about using a B-Tree instead of a Hash? Hold your data sorted by key and it can be traversed by memcpy(). B-Tree retrieval is O(log N), which isn't much of a hit against Hash most of the time.
I found something here which might help, and I'd expect there'd only be a little duck-typing wrapper needed to make it quack like a Hash.
Still gonna need those C/C++ skills, though. (Mine are hopelessly rusty).
This probably isn't fast enough for your needs (which sound like they'll require an extension in C), but perhaps you could use Hash#select?
I agree with Mike Woodhouse's idea. Is it possible for you to construct your shards at the same place where the original 200k-item hash is being constructed? If the items are coming out of a database, you could split your query into multiple disjoint queries, based either on some aspect of the key or by repeatedly using something like LIMIT 10000 to grab a chunk at a time.
Additional
Hi Chris, I just compared your approach to using Hash#select:
require 'benchmark'
s = {}
1.upto(200_000) { |i| s[i] = i}
Benchmark.bm do |x|
x.report {
pivot = s.size / 2
slices = s.each_slice(pivot)
s1 = Hash[*slices.entries[0].flatten]
s2 = Hash[*slices.entries[1].flatten]
}
x.report {
s1 = {}
s2 = {}
s.each_pair do |k,v|
if k < 100_001
s1[k] = v
else
s2[k] = v
end
end
}
end
It looks like Hash#select is much faster, even though it goes through the entire large hash for each one of the sub-hashes:
# ruby test.rb
user system total real
0.560000 0.010000 0.570000 ( 0.571401)
0.320000 0.000000 0.320000 ( 0.323099)
Hope this helps.

Resources