Why must we call to_a on an enumerator object? - ruby

The chaining of each_slice and to_a confuses me. I know that each_slice is a member of Enumerable and therefore can be called on enumerable objects like arrays, and chars does return an array of characters.
I also know that each_slice will slice the array in groups of n elements, which is 2 in the below example. And if a block is not given to each_slice, then it returns an Enumerator object.
'186A08'.chars.each_slice(2).to_a
But why must we call to_a on the enumerator object if each_slice has already grouped the array by n elements? Why doesn't ruby just evaluate what the enumerator object is (which is a collection of n elements)?

The purpose of enumerators is lazy evaluation. When you call each_slice, you get back an enumerator object. This object does not calculate the entire grouped array up front. Instead, it calculates each “slice” as it is needed. This helps save on memory, and also allows you quite a bit of flexibility in your code.
This stack overflow post has a lot of information in it that you’ll find useful:
What is the purpose of the Enumerator class in Ruby
To give you a cut and dry answer to your question “Why must I call to_a when...”, the answer is, it hasn’t. It hasn’t yet looped through the array at all. So far it’s just defined an object that says that when it goes though the array, you’re going to want elements two at a time. You then have the freedom to either force it to do the calculation on all elements in the enumerable (by calling to_a), or you could alternatively use next or each to go through and then stop partway through (maybe calculate only half of them as opposed to calculating all of them and throwing the second half away).
It’s similar to how the Range class does not build up the list of elements in the range. (1..100000) doesn’t make an array of 100000 numbers, but instead defines an object with a min and max and certain operations can be performed on that. For example (1..100000).cover?(5) doesn’t build a massive array to see if that number is in there, but instead just sees if 5 is greater than or equal to 1 and less than or equal to 100000.
The purpose of this all is performance and flexibility.
It may be worth considering whether your implementation actually needs to make an array up front, or whether you can actually keep your RAM consumption down a bit by iterating over the enumerator. (If your real world scenario is as simple as you described, an enumerator won’t help much, but if the array actually is large, an enumerator could help you a lot).

Related

What are .each iterator fetch order guarantees?

I am really baffled by something as it led in hours of head scratching; I have the following segment of code
objectA.arrayA.each do |p|
do stuff with p
end
I thought this was fine, since from this question I felt that since I am using an array for the job so I should be fine. Unfortunately that was not the case since the order that the each iterator returned the elements was not always the same. After hours of looking at other blocks for the issue swapping the above code with this for loop solved the problem:
for i in 0...objectA.arrayA.length
do stuff with the array element
end
Anyone has any idea when the ordering of each is guaranteed?
The docs for Enumerable state
The Enumerable mixin provides collection classes with several
traversal and searching methods, and with the ability to sort. The
class must provide a method each, which yields successive members of
the collection. If Enumerable#max, #min, or #sort is used, the objects
in the collection must also implement a meaningful <=> operator, as
these methods rely on an ordering between members of the collection.
So Array.each must also yield successive members to meet this contract
If an implementation doesn't enforce this, it would be a bug in the implementation

Enumerator::Lazy and Garbage Collection

I am using Ruby's built in CSV parser against large files.
My approach is to separate the parsing with the rest of the logic. To achieve this I am creating an array of hashes. I also want to take advantage of Ruby's Enumerator:: Lazy to prevent loading the entire file in memory.
My question is, when I'm actually iterating through the array of hashes, does the Garbage collector actually clean things up as I go or will it only clean up when the entire array can be cleaned up, essentially still allowing the entire value in memory still?
I'm not asking if it will clean each element as I finish with it, only if it will clean it before the entire enum is actually evaluated.
When you iterate over a plain old array, the garbage collector has no chance to do anything.
You can help the garbage collector by writing nil into the array position after you no longer need the element, so that the object in this position may now be free for collection.
When you correctly use lazy enumerator, you are not iterate over an array of hashes. Instead you enumerate over the hashes, handling one after the other, and each one is read on demand.
So you have the chance to use much less memory (depending on your further processing, and that it does not hold the objects in memory anyway)
the structure may look like this:
enum = Enumerator.new do |yielder|
csv.read(...) do
...
yielder.yield hash
end
end
enum.lazy.map{|hash| do_something(hash); nil}.count
You also need to make sure that you are not generate the array again in the last step of the chain.

how does this Ruby code work? (hash) (Learnrubythehardway)

I know i will look like a total noob, but there's something I can't wrap my head around. Let me emphasize that i DID google this thing, but i didn't find what I was looking for.
I'm going through the learnrubythehardway course, and for ex39 this is one of the functions we have defined:
def Dict.hash_key(aDict, key)
return key.hash % aDict.length
end
The author gives this explanation:
hash_key
This deceptively simple function is the core of how a hash works. What it does is uses the built-in Ruby hash function to convert a
string to a number. Ruby uses this function for its own hash data
structure, and I'm just reusing it. You should fire up a Ruby console
to see how it works. Once I have a number for the key, I then use the
% (modulus) operator and the aDict.length to get a bucket where this
key can go. As you should know, the % (modulus) operator will divide
any number and give me the remainder. I can also use this as a way of
limiting giant numbers to a fixed smaller set of other numbers. If you
don't get this then use Ruby to explore it
I like this course, but the above paragraph was no help.
Ok, you call the function passing it two arguments (aDict is an array) and it returns something.
(My questions are not totally independent of one another.)
What and how does it do that? (ok, it returns a bucket index, but how do we "get there"?)
What does the key.hash do/what is it?
How does using the % help me get what I need? (What is the use of "modding" the key.hash by the aDict.length?)
"Use Ruby to explore it." - ok, but my question No.2. kinda already suggests that I wouldn't know how to go about doing that.
Thanks in advance.
key.hash is calling Object#hash, which is not to be confused with Hash.
Object#hash converts a string into a number consistently (the same string will always result in the same number, in the same running instance of Ruby).
pry(main)> "abc".hash
=> -1672853150
So now we have a number, but it's way too large for the number of buckets in our Dict structure, which defaults to 256 buckets. So we modulus it to get a number within our bucket range.
pry(main)> "abc".hash % 256
=> 98
This essentially allows us to translate Dict["abc"] into aDict[98].
RE: This example in particular
I'm going to change the order of things in a way that I hope makes more sense:
#2. You can think of a hash as a sort of 'fingerprint' of something. The .hash method will create a (generally) unique output for any given input.
#3. In this case, we know that the hash is a number, so we take the modulo of the generated number by the backing array's length in order to find a (hopefully empty) index that is within our storage's bounds.
#1. That's how. A hashing algorithm will return the same output for any given input. The modulo takes this output and turns it into something we can actually use in an array to find something reliably.
#4. Call hash on something. Call it on a string and then modulo it by the length of an array. Try again on another string. Do that again, and use your result to assign something to that array. Do it again to see that the hash and modulo thing will find that value again.
Further Notes:
By itself, the modulo function is not a good way to pick unique indexes for keys. This example is the first step, but especially in a small array, there is still a relatively large chance for the hashes of different keys to modulo into the same number. That's called a collision, and handling those seems to be outside the scope of this question.

Is the .each iterator in ruby guaranteed to give the same order on the same elements every time?

I'm doing something like this with a list 'a':
a.each_with_index |outer, i|
a.each_with_index |inner, j|
if(j > i)
# do some operation with outer and inner
end
end
end
if the iterator is not going to use the same order, this won't work. I don't care what the order actually is, I just need for two .each_with_index iterators to use the same order.
I would assume that it would be a property of an array that it has a fixed order and I'm just being paranoid that the iterator wouldn't use that order...
This depends on the specific Enumerable object you are operating on.
Arrays for example will always return elements in the same order. But other enumerable objects are not guaranteed to behave this way. A good example of this is the 1.8,7 base Hash. That is why many frameworks (most notably ActiveSupport) implement an OrderedHash.
One interesting side note: Even Hash will return objects in the same order if the hash has not changed between each calls. While many objects behave this way, relying on this subtlety is probably not a great idea.
So, no. The generic each will not always return objects in the same order.
P.S. Ruby 1.9's hashes are now actually ordered http://www.igvita.com/2009/02/04/ruby-19-internals-ordered-hash
I've not looked at your actual code but here is your answer taken from the Ruby API docs:
Arrays are ordered, integer-indexed collections of any object.
So yes, you are being paranoid but surely that's a good thing when you're developing?
Array by definition is an ordered list of elements. So you should have no problems with that.
It depends on the specific Enumerable. Certainly an Array will always iterate in the obvious order.
It would be quite lunatic fringe for someone to implement an each method that would traverse the same collection in different ways, but the only actual restriction for such a "feature" would be in the documentation for the class that mixes in Enumerable. Well, in that and the sanity of the implementors.
I can almost imagine some sort of cryptographic API that deliberately traversed a collection in an unpredictable way.

If ruby encourages duck typing so much, why don't we have Hash.count instead of Hash.length?

This is something that really confuses me, it seems like time and time again I run into methods in ruby native data types that do the same thing (essentially), and yet have different names. If duck typing is so strongly encouraged by ruby and the ruby community, why aren't these methods named consistently across types?
You seem to imply that Hash does not have a length method and/or that other enumerables don't have a count method. That is not true.
count is a method defined in the Enumerable module and thus available on all enumerables. It differs from size and length in the following ways:
It (optionally) takes a block specifying which kind of elements to count.
It's available on all enumerables - not just those that keep track of their size - however it has a runtime in O(n) for those that don't (and always when given a block of course).
length and size (which are synonyms) are methods defined on all enumerable classes that keep track of their size (including Hash). They differ from count in that they always return the length in O(1) time and don't take a block.
In summary: You can call length or size on any object that keeps track of its size and you can call count on any enumerable. So duck typing is not hampered in any way.

Resources