Array difference by explicitly specified method or block - ruby

If I have Arrays a and b, the expression a-b returns an Array with all those elements in a which are not in b. "Not in" means unequality (!=) here.
In my case, both arrays only contain elements of the same type (or, from the ducktyping perspective, only elements which understand a "equality" method f).
Is there an easy way to specify this f as a criterium of equality, in a similar way I can provide my own comparator when doing sort? Currently, I implemented this explicitly :
# Get the difference a-b, based on 'f':
a.select { |ael| b.all? {|bel| ael.f != bel.f} }
This works, but I wonder if there is an easier way.
UPDATE: From the comments to this question, I get the impression, that a concrete example would be appreciated. So, here we go:
class Dummy; end
# Create an Array of Dummy objects.
a = Array.new(99) { Dummy.new }
# Pick some of them at random
b = Array.new(10) { a.sample }
# Now I want to get those elements from a, which are not in b.
diff = a.select { |ael| b.all? {|bel| ael.object_id != bel.object_id} }
Of course in this case, I could also have said ! ael eql? bel, but in my general solution, this is not the case.

The "normal" object equality for e.g. Hashes and set operations on Arrays (such as the - operation) uses the output of the Object#hash method of the contained objects along with the semantics of the a.eql?(b) comparison.
This can be used to to improve performance. Ruby assumes here that two objects are eql? if the return value of their respective hash methods is the same (and consequently, assumes that two objects returning different hash values to not be eql?).
For a normal a - b operation, this can thus be used to first calculate the hash value of each object once and then only compare those values. This is quite fast.
Now, if you have a custom equality, your best bet would be to overwrite the object's hash methods so that they return suitable values for those semantics.
A common approach is to build an array containing all data taking part of the object's identity and getting its hash, e.g.
class MyObject
#...
attr_accessor :foo, :bar
def hash
[self.class, foo, bar].hash
end
end
In your object's hash method, you would than include all data that is currently considered by your f comparison method. Instead of actually using f then, you are using the default semantics of all Ruby objects and again can achieve quick set operations with your objects.
If however this is not feasible (e.g. because you need different equality semantics based on use-case), you could emulate what ruby does on your own.
With your f method, you could then perform your set operation as follows:
def f_difference(a, b)
a_map = a.each_with_object({}) do |a_el, hash|
hash[a_el.f] = a_el
end
b.each do |b_el|
a_map.delete b_el.f
end
a_map.values
end
With this approach, you only need to calculate the f value of each of your objects once. We first build a hash map with all f values and elements from a and remove the matching elements from b according to their f values. The remaining values are the result.
This approach saves you from having to loop over b for each object in a which can be slow of you have a lot of objects. If however you only have a few objects on each of your arrays, your original approach should already be fine.
Let's have a look at a benchmark whee I use the standard hash method in place of your custom f to have a comparable result.
require 'benchmark/ips'
def question_diff(a, b)
a.select { |ael| b.all? {|bel| ael.hash != bel.hash} }
end
def answer_diff(a, b)
a_map = a.each_with_object({}) do |a_el, hash|
hash[a_el.hash] = a_el
end
b.each do |b_el|
a_map.delete b_el.hash
end
a_map.values
end
A = Array.new(100) { rand(10_000) }
B = Array.new(10) { A.sample }
Benchmark.ips do |x|
x.report("question") { question_diff(A, B) }
x.report("answer") { answer_diff(A, B) }
x.compare!
end
With Ruby 2.7.1, I get the following result on my machine, showing that the original approach from the question is about 5.9 times slower than the optimized version from my answer:
Warming up --------------------------------------
question 1.304k i/100ms
answer 7.504k i/100ms
Calculating -------------------------------------
question 12.779k (± 2.0%) i/s - 63.896k in 5.002006s
answer 74.898k (± 3.3%) i/s - 375.200k in 5.015239s
Comparison:
answer: 74898.0 i/s
question: 12779.3 i/s - 5.86x (± 0.00) slower

Related

Why does RuboCop suggest replacing .times.map with Array.new?

RuboCop suggests:
Use Array.new with a block instead of .times.map.
In the docs for the cop:
This cop checks for .times.map calls. In most cases such calls can be replaced with an explicit array creation.
Examples:
# bad
9.times.map do |i|
i.to_s
end
# good
Array.new(9) do |i|
i.to_s
end
I know it can be replaced, but I feel 9.times.map is closer to English grammar, and it's easier to understand what the code does.
Why should it be replaced?
The latter is more performant; here is an explanation: Pull request where this cop was added
It checks for calls like this:
9.times.map { |i| f(i) }
9.times.collect(&foo)
and suggests using this instead:
Array.new(9) { |i| f(i) }
Array.new(9, &foo)
The new code has approximately the same size, but uses fewer method
calls, consumes less memory, works a tiny bit faster and in my opinion
is more readable.
I've seen many occurrences of times.{map,collect} in different
well-known projects: Rails, GitLab, Rubocop and several closed-source
apps.
Benchmarks:
Benchmark.ips do |x|
x.report('times.map') { 5.times.map{} }
x.report('Array.new') { Array.new(5){} }
x.compare!
end
__END__
Calculating -------------------------------------
times.map 21.188k i/100ms
Array.new 30.449k i/100ms
-------------------------------------------------
times.map 311.613k (± 3.5%) i/s - 1.568M
Array.new 590.374k (± 1.2%) i/s - 2.954M
Comparison:
Array.new: 590373.6 i/s
times.map: 311612.8 i/s - 1.89x slower
I'm not sure now that Lint is the correct namespace for the cop. Let
me know if I should move it to Performance.
Also I didn't implement autocorrection because it can potentially
break existing code, e.g. if someone has Fixnum#times method redefined
to do something fancy. Applying autocorrection would break their code.
If you feel it is more readable, go with it.
This is a performance rule and most codepaths in your application are probably not performance critical. Personally, I am always open to favor readability over premature optimization.
That said
100.times.map { ... }
times creates an Enumerator object
map enumerates over that object without being able to optimize, for example the size of the array is not known upfront and it might have to reallocate more space dynamically and it has to enumerate over the values by calling Enumerable#each since map is implemented that way
Whereas
Array.new(100) { ... }
new allocates an array of size N
And then uses a native loop to fill in the values
When you need to map the result of a block invoked a fixed amount of times, you have an option between:
Array.new(n) { ... }
and:
n.times.map { ... }
The latter one is about 60% slower for n = 10, which goes down to around 40% for n > 1_000.
Note: logarithmic scale!

How to override Enumerables sort method

I'm trying to create a method that uses the functionality of Enumerables's sort method.
Imagine I have this data
data = [{project: 'proj', version: '1.1'}, {project: 'proj2', version: '1.11'}, {project: 'proj3', version: '1.2'}]
I want to be able to call the method like this:
data.natural_sort{|a,b| b[:version] <=> a[:version] }
The actual call that happens would achieve something like this:
data.sort{|a,b| MyModule.naturalize_str(b[:version]) <=> MyModule.naturalize_str(a[:version]) }
Heres my current broken code:
Enumerable.module_eval do
def natural_sort(&block)
if !block_given?
block = Proc.new{|a,b| Rearmed.naturalize_str(a[:version]) <=> Rearmed.naturalize_str(b[:version])}
end
sort do |a,b|
a = Rearmed.naturalize_str(a)
b = Rearmed.naturalize_str(b)
block.call(a,b)
end
end
end
It throws an error because a and b are the hashes instead of the versions I wanted.
You're working at odds with yourself here. In your natural_sort block you're expecting hash objects, yet within the implementation you've explicitly cast a and b to be strings.
In Ruby there's two ways to sort, the sort method with a,b pairs, and the sort_by method which uses an intermediate sort form to do the comparisons. The sort_by approach is usually significantly faster since it applies the transform to each object once, while the sort method does it each time a comparison is done.
Here's a rewrite:
def natural_sort_by(&block)
if (block_given?)
sort_by do |o|
Rearmed.naturalize_str(yield(o))
end
else
sort_by do |o|
Rearmed.naturalize_str(o)
end
end
end
Then you can call it this way:
data.natural_sort_by { |o| o[:version] }

Ruby Hash destructive vs. non-destructive method

Could not find a previous post that answers my question...I'm learning how to use destructive vs. non-destructive methods in Ruby. I found an answer to the exercise I'm working on (destructively adding a number to hash values), but I want to be clear on why some earlier solutions of mine did not work. Here's the answer that works:
def modify_a_hash(the_hash, number_to_add_to_each_value)
the_hash.each { |k, v| the_hash[k] = v + number_to_add_to_each_value}
end
These two solutions come back as non-destructive (since they all use "each" I cannot figure out why. To make something destructive is it the equals sign above that does the trick?):
def modify_a_hash(the_hash, number_to_add_to_each_value)
the_hash.each_value { |v| v + number_to_add_to_each_value}
end
def modify_a_hash(the_hash, number_to_add_to_each_value)
the_hash.each { |k, v| v + number_to_add_to_each_value}
end
The terms "destructive" and "non-destructive" are a bit misleading here. Better is to use the conventional "in-place modification" vs. "returns a copy" terminology.
Generally methods that modify in-place have ! at the end of their name to serve as a warning, like gsub! for String. Some methods that pre-date this convention do not have them, like push for Array.
The = performs an assignment within the loop. Your other examples don't actually do anything useful since each returns the original object being iterated over regardless of any results produced.
If you wanted to return a copy you'd do this:
def modify_a_hash(the_hash, number_to_add)
Hash[
the_hash.collect do |k, v|
[ k, v + number_to_add ]
end
]
end
That would return a copy. The inner operation collect transforms key-value pairs into new key-value pairs with the adjustment applied. No = is required since there's no assignment.
The outer method Hash[] transforms those key-value pairs into a proper Hash object. This is then returned and is independent of the original.
Generally a non-destructive or "return a copy" method needs to create a new, independent version of the thing it's manipulating for the purpose of storing the results. This applies to String, Array, Hash, or any other class or container you might be working with.
Maybe this slightly different example will be helpful.
We have a hash:
2.0.0-p481 :014 > hash
=> {1=>"ann", 2=>"mary", 3=>"silvia"}
Then we iterate over it and change all the letters to the uppercase:
2.0.0-p481 :015 > hash.each { |key, value| value.upcase! }
=> {1=>"ANN", 2=>"MARY", 3=>"SILVIA"}
The original hash has changed because we used upcase! method.
Compare to method without ! sign, that doesn't modify hash values:
2.0.0-p481 :017 > hash.each { |key, value| value.downcase }
=> {1=>"ANN", 2=>"MARY", 3=>"SILVIA"}

Some simple Ruby questions - iterators, blocks, and symbols

My background is in PHP and C#, but I'd really like to learn RoR. To that end, I've started reading the official documentation. I have some questions about some code examples.
The first is with iterators:
class Array
def inject(n)
each { |value| n = yield(n, value) }
n
end
def sum
inject(0) { |n, value| n + value }
end
def product
inject(1) { |n, value| n * value }
end
end
I understand that yield means "execute the associated block here." What's throwing me is the |value| n = part of the each. The other blocks make more sense to me as they seem to mimic C# style lambdas:
public int sum(int n, int value)
{
return Inject((n, value) => n + value);
}
But the first example is confusing to me.
The other is with symbols. When would I want to use them? And why can't I do something like:
class Example
attr_reader #member
# more code
end
In the inject or reduce method, n represents an accumulated value; this means the result of every iteration is accumulated in the n variable. This could be, as is in your example, the sum or product of the elements in the array.
yield returns the result of the block, which is stored in n and used in the next iterations. This is what makes the result "cumulative."
a = [ 1, 2, 3 ]
a.sum # inject(0) { |n, v| n + v }
# n == 0; n = 0 + 1
# n == 1; n = 1 + 2
# n == 3; n = 3 + 3
=> 6
Also, to compute the sum you could also have written a.reduce :+. This works for any binary operation. If your method is named symbol, writing a.reduce :symbol is the same as writing a.reduce { |n, v| n.symbol v }.
attr and company are actually methods. Under the hood, they dynamically define the methods for you. It uses the symbol you passed to work out the names of the instance variable and the methods. :member results in the #member instance variable and the member and member = methods.
The reason you can't write attr_reader #member is because #member isn't an object in itself, nor can it be converted to a symbol; it actually tells ruby to fetch the value of the instance variable #member of the self object, which, at class scope, is the class itself.
To illustrate:
class Example
#member = :member
attr_accessor #member
end
e = Example.new
e.member = :value
e.member
=> :value
Remember that accessing unset instance variables yields nil, and since the attr method family accepts only symbols, you get: TypeError: nil is not a symbol.
Regarding Symbol usage, you can sort of use them like strings. They make excellent hash keys because equal symbols always refer to the same object, unlike strings.
:a.object_id == :a.object_id
=> true
'a'.object_id == 'a'.object_id
=> false
They're also commonly used to refer to method names, and can actually be converted to Procs, which can be passed to methods. This is what allows us to write things like array.map &:to_s.
Check out this article for more interpretations of the symbol.
For the definition of inject, you're basically setting up chained blocks. Specifically, the variable n in {|value| n = yield(n, value)} is essentially an accumulator for the block passed to inject. So, for example, for the definition of product, inject(1) {|value| n * value}, let's assume you have an array my_array = [1, 2, 3, 4]. When you call my_array.product, you start by calling inject with n = 1. each yields to the block defined in inject, which in turns yields to the block passed to inject itself with n (1) and the first value in the array (1 as well, in this case). This block, {|n, value| n * value} returns 1 == 1 * 1, which is set it inject's n variable. Next, 2 is yielded from each, and the block defined in inject block yields as yield(1, 2), which returns 2 and assigns it to n. Next 3 is yielded from each, the block yields the values (2, 3) and returns 6, which is stored in n for the next value, and so forth. Essentially, tracking the overall value agnostic of the calculation being performed in the specialised routines (sum and product) allows for generalization. Without that, you'd have to declare e.g.
def sum
n = 0
each {|val| n += val}
end
def product
n = 1
each {|val| n *= val}
end
which is annoyingly repetitive.
For your second question, attr_reader and its family are themselves methods that are defining the appropriate accessor routines using define_method internally, in a process called metaprogramming; they are not language statements, but just plain old methods. These functions expect to passed a symbol (or, perhaps, a string) that gives the name of the accessors you're creating. You could, in theory, use instance variables such as #member here, though it would be the value to which #member points that would be passed in and used in define_method. For an example of how these are implemented, this page shows some examples of attr_* methods.
def inject(accumulator)
each { |value| accumulator = yield(accumulator, value) }
accumulator
end
This is just yielding the current value of accumulator and the array item to inject's block and then storing the result back into accumulator again.
class Example
attr_reader #member
end
attr_reader is just a method whose argument is the name of the accessor you want to setup. So, in a contrived way you could do
class Example
#ivar_name = 'foo'
attr_reader #ivar_name
end
to create an getter method called foo
Your confusion with the first example may be due to your reading |value| n as a single expression, but it isn't.
This reformatted version might be clearer to you:
def inject(n)
each do |value|
n = yield(n, value)
end
return n
end
value is an element in the array, and it is yielded with n to whatever block is passed to inject, the result of which is set to n. If that's not clear, read up on the each method, which takes a block and yields each item in the array to it. Then it should be clearer how the accumulation works.
attr_reader is less weird when you consider that it is a method for generating accessor methods. It's not an accessor in itself. It doesn't need to deal with the #member variable's value, just its name. :member is just the interned version of the string 'member', which is the name of the variable.
You can think of symbols as lighter weight strings, with the additional bonus that every equal label is the same object - :foo.object_id == :foo.object_id, whereas 'foo'.object_id != 'foo'.object_id, because each 'foo' is a new object. You can try that for yourself in irb. Think of them as labels, or primitive strings. They're surprisingly useful and come up a lot, e.g. for metaprogramming or as keys in hashes. As pointed out elsewhere, calling object.send :foo is the same as calling object.foo
It's probably worth reading some early chapters from the 'pickaxe' book to learn some more ruby, it will help you understand and appreciate the extra stuff rails adds.
First you need to understand where to use symbols and where its not..
Symbol is especially used to represent something. Ex: :name, :age like that. Here we are not going to perform any operations using this.
String are used only for data processing. Ex: 'a = name'. Here I gonna use this variable 'a' further for other string operations in ruby.
Moreover, symbol is more memory efficient than strings and it is immutable. That's why ruby developer's prefers symbols than string.
You can even use inject method to calculate sum as (1..5).to_a.inject(:+)

Difference between map and collect in Ruby?

I have Googled this and got patchy / contradictory opinions - is there actually any difference between doing a map and doing a collect on an array in Ruby/Rails?
The docs don't seem to suggest any, but are there perhaps differences in method or performance?
There's no difference, in fact map is implemented in C as rb_ary_collect and enum_collect (eg. there is a difference between map on an array and on any other enum, but no difference between map and collect).
Why do both map and collect exist in Ruby? The map function has many naming conventions in different languages. Wikipedia provides an overview:
The map function originated in functional programming languages but is today supported (or may be defined) in many procedural, object oriented, and multi-paradigm languages as well: In C++'s Standard Template Library, it is called transform, in C# (3.0)'s LINQ library, it is provided as an extension method called Select. Map is also a frequently used operation in high level languages such as Perl, Python and Ruby; the operation is called map in all three of these languages. A collect alias for map is also provided in Ruby (from Smalltalk) [emphasis mine]. Common Lisp provides a family of map-like functions; the one corresponding to the behavior described here is called mapcar (-car indicating access using the CAR operation).
Ruby provides an alias for programmers from the Smalltalk world to feel more at home.
Why is there a different implementation for arrays and enums? An enum is a generalized iteration structure, which means that there is no way in which Ruby can predict what the next element can be (you can define infinite enums, see Prime for an example). Therefore it must call a function to get each successive element (typically this will be the each method).
Arrays are the most common collection so it is reasonable to optimize their performance. Since Ruby knows a lot about how arrays work it doesn't have to call each but can only use simple pointer manipulation which is significantly faster.
Similar optimizations exist for a number of Array methods like zip or count.
I've been told they are the same.
Actually they are documented in the same place under ruby-doc.org:
http://www.ruby-doc.org/core/classes/Array.html#M000249
ary.collect {|item| block } → new_ary
ary.map {|item| block } → new_ary
ary.collect → an_enumerator
ary.map → an_enumerator
Invokes block once for each element of self.
Creates a new array containing the values returned by the block.
See also Enumerable#collect.
If no block is given, an enumerator is returned instead.
a = [ "a", "b", "c", "d" ]
a.collect {|x| x + "!" } #=> ["a!", "b!", "c!", "d!"]
a #=> ["a", "b", "c", "d"]
The collect and collect! methods are aliases to map and map!, so they can be used interchangeably. Here is an easy way to confirm that:
Array.instance_method(:map) == Array.instance_method(:collect)
=> true
I did a benchmark test to try and answer this question, then found this post so here are my findings (which differ slightly from the other answers)
Here is the benchmark code:
require 'benchmark'
h = { abc: 'hello', 'another_key' => 123, 4567 => 'third' }
a = 1..10
many = 500_000
Benchmark.bm do |b|
GC.start
b.report("hash keys collect") do
many.times do
h.keys.collect(&:to_s)
end
end
GC.start
b.report("hash keys map") do
many.times do
h.keys.map(&:to_s)
end
end
GC.start
b.report("array collect") do
many.times do
a.collect(&:to_s)
end
end
GC.start
b.report("array map") do
many.times do
a.map(&:to_s)
end
end
end
And the results I got were:
user system total real
hash keys collect 0.540000 0.000000 0.540000 ( 0.570994)
hash keys map 0.500000 0.010000 0.510000 ( 0.517126)
array collect 1.670000 0.020000 1.690000 ( 1.731233)
array map 1.680000 0.020000 1.700000 ( 1.744398)
Perhaps an alias isn't free?
Ruby aliases the method Array#map to Array#collect; they can be used interchangeably. (Ruby Monk)
In other words, same source code :
static VALUE
rb_ary_collect(VALUE ary)
{
long i;
VALUE collect;
RETURN_SIZED_ENUMERATOR(ary, 0, 0, ary_enum_length);
collect = rb_ary_new2(RARRAY_LEN(ary));
for (i = 0; i < RARRAY_LEN(ary); i++) {
rb_ary_push(collect, rb_yield(RARRAY_AREF(ary, i)));
}
return collect;
}
http://ruby-doc.org/core-2.2.0/Array.html#method-i-map
#collect is actually an alias for #map. That means the two methods can be used interchangeably, and effect the same behavior.

Resources