Fastest way to get maximum value from an exclusive Range in ruby - ruby

Ok, so say you have a really big Range in ruby. I want to find a way to get the max value in the Range.
The Range is exclusive (defined with three dots) meaning that it does not include the end object in it's results. It could be made up of Integer, String, Time, or really any object that responds to #<=> and #succ. (which are the only requirements for the start/end object in Range)
Here's an example of an exclusive range:
past = Time.local(2010, 1, 1, 0, 0, 0)
now = Time.now
range = past...now
range.include?(now) # => false
Now I know I could just do something like this to get the max value:
range.max # => returns 1 second before "now" using Enumerable#max
But this will take a non-trivial amount of time to execute. I also know that I could subtract 1 second from whatever the end object is is. However, the object may be something other than Time, and it may not even support #-. I would prefer to find an efficient general solution, but I am willing to combine special case code with a fallback to a general solution (more on that later).
As mentioned above using Range#last won't work either, because it's an exclusive range and does not include the last value in it's results.
The fastest approach I could think of was this:
max = nil
range.each { |value| max = value }
# max now contains nil if the range is empty, or the max value
This is similar to what Enumerable#max does (which Range inherits), except that it exploits the fact that each value is going to be greater than the previous, so we can skip using #<=> to compare the each value with the previous (the way Range#max does) saving a tiny bit of time.
The other approach I was thinking about was to have special case code for common ruby types like Integer, String, Time, Date, DateTime, and then use the above code as a fallback. It'd be a bit ugly, but probably much more efficient when those object types are encountered because I could use subtraction from Range#last to get the max value without any iterating.
Can anyone think of a more efficient/faster approach than this?

The simplest solution that I can think of, which will work for inclusive as well as exclusive ranges:
range.max
Some other possible solutions:
range.entries.last
range.entries[-1]
These solutions are all O(n), and will be very slow for large ranges. The problem in principle is that range values in Ruby are enumerated using the succ method iteratively on all values, starting at the beginning. The elements do not have to implement a method to return the previous value (i.e. pred).
The fastest method would be to find the predecessor of the last item (an O(1) solution):
range.exclude_end? ? range.last.pred : range.last
This works only for ranges that have elements which implement pred. Later versions of Ruby implement pred for integers. You have to add the method yourself if it does not exist (essentially equivalent to special case code you suggested, but slightly simpler to implement).
Some quick benchmarking shows that this last method is the fastest by many orders of magnitude for large ranges (in this case range = 1...1000000), because it is O(1):
user system total real
r.entries.last 11.760000 0.880000 12.640000 ( 12.963178)
r.entries[-1] 11.650000 0.800000 12.450000 ( 12.627440)
last = nil; r.each { |v| last = v } 20.750000 0.020000 20.770000 ( 20.910416)
r.max 17.590000 0.010000 17.600000 ( 17.633006)
r.exclude_end? ? r.last.pred : r.last 0.000000 0.000000 0.000000 ( 0.000062)
Benchmark code is here.
In the comments it is suggested to use range.last - (range.exclude_end? ? 1 : 0). It does work for dates without additional methods, but will never work for non-numeric ranges. String#- does not exist and makes no sense with integer arguments. String#pred, however, can be implented.

I'm not sure about the speed (and initial tests don't seem incredibly fast), but the following might do what you need:
past = Time.local(2010, 1, 1, 0, 0, 0)
now = Time.now
range = past...now
range.to_a[-1]
Very basic testing (counting in my head) showed that it took about 4 seconds while the method you provided took about 5-6. Hope this helps.
Edit 1: Removed second solution as it was totally wrong.

I can't think there's any way to achieve this that doesn't involve enumerating the range, at least unless as already mentioned, you have other information about how the range will be constructed and therefore can infer the desired value without enumeration. Of all the suggestions, I'd go with #max, since it seems to be most expressive.
require 'benchmark'
N = 20
Benchmark.bm(30) do |r|
past, now = Time.local(2010, 2, 1, 0, 0, 0), Time.now
#range = past...now
r.report("range.max") do
N.times { last_in_range = #range.max }
end
r.report("explicit enumeration") do
N.times { #range.each { |value| last_in_range = value } }
end
r.report("range.entries.last") do
N.times { last_in_range = #range.entries.last }
end
r.report("range.to_a[-1]") do
N.times { last_in_range = #range.to_a[-1] }
end
end
user system total real
range.max 49.406000 1.515000 50.921000 ( 50.985000)
explicit enumeration 52.250000 1.719000 53.969000 ( 54.156000)
range.entries.last 53.422000 4.844000 58.266000 ( 58.390000)
range.to_a[-1] 49.187000 5.234000 54.421000 ( 54.500000)
I notice that the 3rd and 4th option have significantly increased system time. I expect that's related to the explicit creation of an array, which seems like a good reason to avoid them, even if they're not obviously more expensive in elapsed time.

Related

How fast is Ruby's method for converting a range to an array?

(1..999).to_a
Is this method O(n)? I'm wondering if the conversion involves an implicit iteration so Ruby can write the values one-by-one into consecutive memory addresses.
The method is actually slightly worse than O(n). Not only does it do a naive iteration, but it doesn't check ahead of time what the size will be, so it has to repeatedly allocate more memory as it iterates. I've opened an issue for that aspect and it's been discussed a few times on the mailing list (and briefly added to ruby-core). The problem is that, like almost anything in Ruby, Range can be opened up and messed with, so Ruby can't really optimize the method. It can't even count on Range#size returning the correct result. Worse, some enumerables even have their size method delegate to to_a.
In general, it shouldn't be necessary to make this conversion, but if you really need array methods, you might be able to use Array#fill instead, which lets you populate a (potentially pre-allocated) array using values derived from its indices.
Range.instance_methods(false).include? :to_a
# false
Range doesn't have to_a, it inherits it from the Enumerable mix-in, so it builds the array by pushing each value on one at a time. That seems like it would be really inefficient, but I'll let the benchmark speak for itself:
require 'benchmark'
size = 100_000_000
Benchmark.bmbm do |r|
r.report("range") { (0...size).to_a }
r.report("fill") { a = Array.new(size); a.fill { |i| i } }
r.report("array") { Array.new(size) { |i| i } }
end
# Rehearsal -----------------------------------------
# range 4.530000 0.180000 4.710000 ( 4.716628)
# fill 5.810000 0.150000 5.960000 ( 5.966710)
# array 7.630000 0.250000 7.880000 ( 7.879940)
# ------------------------------- total: 18.550000sec
#
# user system total real
# range 4.540000 0.120000 4.660000 ( 4.660249)
# fill 5.980000 0.110000 6.090000 ( 6.089962)
# array 7.880000 0.110000 7.990000 ( 7.985818)
Isn't that weird? It's actually the fastest by a significant margin. And manually filling an Array is somehow faster that just using the constructor.
But as is usually the case with Ruby, don't worry too much about this. For reasonably sized ranges the performance difference will be negligible.

More concise way of setting a counter, incrementing, and returning?

I'm working on a problem in which I want to compare two strings of equal length, char by char. For each index where the chars differ, I need to increment the counter by 1. Right now I have this:
def compute(strand1, strand2)
raise ArgumentError, "Sequences are different lengths" unless strand1.length == strand2.length
mutations = 0
strand1.chars.each_with_index { |nucleotide, index| mutations += 1 if nucleotide != strand2[index] }
mutations
end
end
I'm new to this language but this feels very non-Ruby to me. Is there a one-liner that can consolidate the process of setting a counter, incrementing it, and then returning it?
I was thinking along the lines of selecting all chars that don't match up and then getting the size of the resulting array. However, from what I could tell, there is no select_with_index method. I was also looking into inject but can't seem to figure out how I would apply it in this scenario.
Probably the simplest answer is to just count the differences:
strand1.chars.zip(strand2.chars).count{|a, b| a != b}
Here is a number of one-liners:
strand1.size.times.select { |index| strand1[index] != strand2[index] }.size
This is not the best one since it generates an intermediary array up to O(n) in size.
strand1.size.times.inject(0) { |sum, index| sum += 1 if strand1[index] != strand2[index]; sum }
Does not take extra memory but is a bit hard to read.
strand1.chars.each_with_index.count { |x, index| x != strand2[index] }
I'd go with that. Kudos to user12341234 for mentioning count which is this is built upon.
Update
Benchmarking on my machine gives different results from what #CarySwoveland gets:
user system total real
mikej 4.080000 0.320000 4.400000 ( 4.408245)
user12341234 3.790000 0.210000 4.000000 ( 4.003349)
Nik 2.830000 0.170000 3.000000 ( 3.008533)
user system total real
mikej 3.590000 0.020000 3.610000 ( 3.618018)
user12341234 4.040000 0.140000 4.180000 ( 4.183357)
lazy user 4.250000 0.240000 4.490000 ( 4.497161)
Nik 2.790000 0.010000 2.800000 ( 2.808378)
This is not to point out my code as having better performance, but to mention that environment plays a great role in choosing any particular implementation approach.
I'm running this on native 3.13.0-24-generic #47-Ubuntu SMP x64 on 12-Core i7-3930K with enough RAM.
This is an just extended comment, so no upvotes please (downvotes are OK).
Gentlemen: start your engines!
Test problem
def random_string(n, selection)
n.times.each_with_object('') { |_,s| s << selection.sample }
end
n = 5_000_000
a = ('a'..'e').to_a
s1 = random_string(n,a)
s2 = random_string(n,a)
Benchmark code
require 'benchmark'
Benchmark.bm(12) do |x|
x.report('mikej') do
s1.chars.each_with_index.inject(0) {|c,(n,i)| n==s2[i] ? c : c+1}
end
x.report('user12341234') do
s1.chars.zip(s2.chars).count{|a,b| a != b }
end
x.report('lazy user') do
s1.chars.zip(s2.chars).lazy.count{|a,b| a != b }
end
x.report('Nik') do
s1.chars.each_with_index.count { |x,i| x != s2[i] }
end
end
Results
mikej 6.220000 0.430000 6.650000 ( 6.651575)
user12341234 6.600000 0.900000 7.500000 ( 7.504499)
lazy user 7.460000 7.800000 15.260000 ( 15.255265)
Nik 6.140000 3.080000 9.220000 ( 9.225023)
user system total real
mikej 6.190000 0.470000 6.660000 ( 6.662569)
user12341234 6.720000 0.500000 7.220000 ( 7.223716)
lazy user 7.250000 7.110000 14.360000 ( 14.356845)
Nik 5.690000 0.920000 6.610000 ( 6.621889)
[Edit: I added a lazy version of 'user12341234'. As is generally the case with lazy versions of enumerators, there is a tradeoff between the amount of memory used and execution time. I did several runs. Here I report the results of two typical ones. There was very little variability for 'mikej' and 'user12341234', somewhat more for lazy user and quite a bit for Nik.]
inject would indeed by a way to do this with a one-liner so you are on the right track:
strand1.chars.each_with_index.inject(0) { |count, (nucleotide, index)|
nucleotide == strand2[index] ? count : count + 1 }
We start with an initial value of 0 and then if the 2 letters are the same we just return the current value of the accumulator value, if the letters are different we add 1.
Also, notice here that each_with_index is being called without a block. When called like this it returns an enumerator object. This lets us use one of the other enumerator methods (in this case inject) with the pairs of value and index returned by each_with_index, allowing the functionality of each_with_index and inject to be combined.
Edit: looking at it, user12341234's solution is a nice way of doing it. Good use of zip! I'd probably go with that.

Finding the minimum of mapped data

Given an array of complex objects, an algorithm for mapping each to Comparable values, and the desire to find the minimum such value, is there a built-in library method that will do this in a single pass?
Effective but not perfectly efficient solutions:
# Iterates through the array twice
min = objects.map{ |o| make_number o }.min
# Calls make_number one time more than is necessary
min = make_number( objects.min_by{ |o| make_number o } )
Efficient, but verbose solution:
min = nil
objects.each{ |o| n=make_number(o); min=n if !min || n<min }
No, no such library method already exists.
I don't really see an issue with either of your two original solutions. The enumerator code is written in C and is generally very fast. You can always just benchmark it and see what is fastest for your specific dataset and code (try https://github.com/acangiano/ruby-benchmark-suite)
However, if you really do want one pass, you can simplify your #each version by using #reduce:
min = objects.reduce(Float::INFINITY){ |min, o|
n = make_number(o)
min > n ? n : min
}
If your objects are already numbers of some form, you can omit the Float::INFINITY. Otherwise, in order to make sure we are only comparing number values, you will need to add it.

Count, size, length...too many choices in Ruby?

I can't seem to find a definitive answer on this and I want to make sure I understand this to the "n'th level" :-)
a = { "a" => "Hello", "b" => "World" }
a.count # 2
a.size # 2
a.length # 2
a = [ 10, 20 ]
a.count # 2
a.size # 2
a.length # 2
So which to use? If I want to know if a has more than one element then it doesn't seem to matter but I want to make sure I understand the real difference. This applies to arrays too. I get the same results.
Also, I realize that count/size/length have different meanings with ActiveRecord. I'm mostly interested in pure Ruby (1.92) right now but if anyone wants to chime in on the difference AR makes that would be appreciated as well.
Thanks!
For arrays and hashes size is an alias for length. They are synonyms and do exactly the same thing.
count is more versatile - it can take an element or predicate and count only those items that match.
> [1,2,3].count{|x| x > 2 }
=> 1
In the case where you don't provide a parameter to count it has basically the same effect as calling length. There can be a performance difference though.
We can see from the source code for Array that they do almost exactly the same thing. Here is the C code for the implementation of array.length:
static VALUE
rb_ary_length(VALUE ary)
{
long len = RARRAY_LEN(ary);
return LONG2NUM(len);
}
And here is the relevant part from the implementation of array.count:
static VALUE
rb_ary_count(int argc, VALUE *argv, VALUE ary)
{
long n = 0;
if (argc == 0) {
VALUE *p, *pend;
if (!rb_block_given_p())
return LONG2NUM(RARRAY_LEN(ary));
// etc..
}
}
The code for array.count does a few extra checks but in the end calls the exact same code: LONG2NUM(RARRAY_LEN(ary)).
Hashes (source code) on the other hand don't seem to implement their own optimized version of count so the implementation from Enumerable (source code) is used, which iterates over all the elements and counts them one-by-one.
In general I'd advise using length (or its alias size) rather than count if you want to know how many elements there are altogether.
Regarding ActiveRecord, on the other hand, there are important differences. check out this post:
Counting ActiveRecord associations: count, size or length?
There is a crucial difference for applications which make use of database connections.
When you are using many ORMs (ActiveRecord, DataMapper, etc.) the general understanding is that .size will generate a query that requests all of the items from the database ('select * from mytable') and then give you the number of items resulting, whereas .count will generate a single query ('select count(*) from mytable') which is considerably faster.
Because these ORMs are so prevalent I following the principle of least astonishment. In general if I have something in memory already, then I use .size, and if my code will generate a request to a database (or external service via an API) I use .count.
In most cases (e.g. Array or String) size is an alias for length.
count normally comes from Enumerable and can take an optional predicate block. Thus enumerable.count {cond} is [roughly] (enumerable.select {cond}).length -- it can of course bypass the intermediate structure as it just needs the count of matching predicates.
Note: I am not sure if count forces an evaluation of the enumeration if the block is not specified or if it short-circuits to the length if possible.
Edit (and thanks to Mark's answer!): count without a block (at least for Arrays) does not force an evaluation. I suppose without formal behavior it's "open" for other implementations, if forcing an evaluation without a predicate ever even really makes sense anyway.
I found a good answare at http://blog.hasmanythrough.com/2008/2/27/count-length-size
In ActiveRecord, there are several ways to find out how many records
are in an association, and there are some subtle differences in how
they work.
post.comments.count - Determine the number of elements with an SQL
COUNT query. You can also specify conditions to count only a subset of
the associated elements (e.g. :conditions => {:author_name =>
"josh"}). If you set up a counter cache on the association, #count
will return that cached value instead of executing a new query.
post.comments.length - This always loads the contents of the
association into memory, then returns the number of elements loaded.
Note that this won't force an update if the association had been
previously loaded and then new comments were created through another
way (e.g. Comment.create(...) instead of post.comments.create(...)).
post.comments.size - This works as a combination of the two previous
options. If the collection has already been loaded, it will return its
length just like calling #length. If it hasn't been loaded yet, it's
like calling #count.
Also I have a personal experience:
<%= h(params.size.to_s) %> # works_like_that !
<%= h(params.count.to_s) %> # does_not_work_like_that !
We have a several ways to find out how many elements in an array like .length, .count and .size. However, It's better to use array.size rather than array.count. Because .size is better in performance.
Adding more to Mark Byers answer. In Ruby the method array.size is an alias to Array#length method. There is no technical difference in using any of these two methods. Possibly you won't see any difference in performance as well. However, the array.count also does the same job but with some extra functionalities Array#count
It can be used to get total no of elements based on some condition. Count can be called in three ways:
Array#count # Returns number of elements in Array
Array#count n # Returns number of elements having value n in Array
Array#count{|i| i.even?} Returns count based on condition invoked on each element array
array = [1,2,3,4,5,6,7,4,3,2,4,5,6,7,1,2,4]
array.size # => 17
array.length # => 17
array.count # => 17
Here all three methods do the same job. However here is where the count gets interesting.
Let us say, I want to find how many array elements does the array contains with value 2
array.count 2 # => 3
The array has a total of three elements with value as 2.
Now, I want to find all the array elements greater than 4
array.count{|i| i > 4} # =>6
The array has total 6 elements which are > than 4.
I hope it gives some info about count method.

Code folding on consecutive collect/select/reject/each

I play around with arrays and hashes quite a lot in ruby and end up with some code that looks like this:
sum = two_dimensional_array.select{|i|
i.collect{|j|
j.to_i
}.sum > 5
}.collect{|i|
i.collect{|j|
j ** 2
}.average
}.sum
(Let's all pretend that the above code sample makes sense now...)
The problem is that even though TextMate (my editor of choice) picks up simple {...} or do...end blocks quite easily, it can't figure out (which is understandable since even I can't find a "correct" way to fold the above) where the above blocks start and end to fold them.
How would you fold the above code sample?
PS: considering that it could have 2 levels of folding, I only care about the outer consecutive ones (the blocks with the i)
To be honest, something that convoluted is probably confusing TextMate as much as anyone else who has to maintain it, and that includes you in the future.
Whenever you see something that rolls up into a single value, it's a good case for using Enumerable#inject.
sum = two_dimensional_array.inject(0) do |sum, row|
# Convert row to Fixnum equivalent
row_i = row.collect { |i| i.to_i }
if (row_i.sum > 5)
sum += row_i.collect { |i| i ** 2 }.average
end
sum # Carry through to next inject call
end
What's odd in your example is you're using select to return the full array, allegedly converted using to_i, but in fact Enumerable#select does no such thing, and instead rejects any for which the function returns nil. I'm presuming that's none of your values.
Also depending on how your .average method is implemented, you may want to seed the inject call with 0.0 instead of 0 to use a floating-point value.

Resources