Say you want to generate a random number between 1 and 1 billion:
rand(1..1_000_000_000)
Will Ruby create an array from that range every time you call this line of code?
Rubocop suggests this approach over rand(1_000_000_000)+1 but it seems there's potential for pain.
Ruby's docs say this:
# When +max+ is a Range, +rand+ returns a random number where
# range.member?(number) == true.
Where +max+ is the argument passed to rand, but it doesn't say how it gets the number argument. I'm also not sure if calling .member? on a range is performant.
Any ideas?
I can use benchmark but still curious about the inner workings here.
No, Ruby will not create an array from that range, unless you explicitly call the .to_a method on the Range object. In fact, rand() doesn't work on arrays - .sample is the method to use for returning a random element from an array.
The Range class includes Enumerable so you get Enumerable's iteration methods without having to convert the range into an array. The lower and upper limits for a Range are (-Float::INFINITY..Float::INFINITY), although that will result in a Numerical argument out of domain error if you pass it into rand.
As for .member?, that method simply calls a C function called range_cover that calls another one called r_cover_p which checks if a value is between two numbers or strings.
To test the difference in speed between passing a range to rand and calling sample on an array, you can perform the following test:
require 'benchmark'
puts Benchmark.measure { rand(0..10_000_000) }
=> 0.000000 0.000000 0.000000 ( 0.000009)
puts Benchmark.measure { (0..10_000_000).to_a.sample }
=> 0.300000 0.030000 0.330000 ( 0.347752)
As you can see in the first example, passing in a range as a parameter to rand is extremely rapid.
Contrarily, calling .to_a.sample on a range is rather slow. This is due to the array creation process which requires allocating the appropriate data into memory. The .sample method should be relatively fast as it simply passes a random and unique index into the array and returns that element.
To check out the code for range have a look here.
Related
I have an array containing multiple arrays (the number may vary) of significant dimensions (an array may contain up to 150 objects).
With the array, I need to find a combination (one element for each sub-array) that matches a condition.
Due to the dimensions, I tried to use Enumerator::Lazy as follows
catch :match do
array[0].product(*array[1..-1]).lazy.each do |combination|
throw :match if ConditionMatcher.match(combination)
end
end
However, I realize when I call each the enumerator is evaluated and it performs very slowly.
I have tried to replace each with methods included in Enumerator::Lazy such as take_while
array[0].product(*array[1..-1]).lazy.take_while do |combination|
return false if ConditionMatcher.match(combination)
end
But also, in this case, the product is evaluated with low performance.
For better performance, even id I don't really like it, I'm thinking to replace product with a nested each loop. Something like
catch :match do
array[0].each do |first|
array[1].each do |second|
array[2].each do |third|
throw :match if ConditionMatcher.match([first, second, third])
end
end
end
end
Due to the fact that the number of sub-arrays changes from time to time. I'm not sure how to implement it.
Moreover, is there a better way to loop through all the sub-arrays without loading the entire set of combinations?
Update 1 -
Each sub-array contains an ActiveRecord::Relation on a polymorphic association. Therefore, each element of each combination responds to the same 2 methods (start_time and end_time) each returning an instance of Time.
The matcher checks if all the objects in the combination don't have overlapping times.
The problem is that Array#product already returns a huge array containing all combinations. With 3 sub-arrays containing 150 items each, it returns a 150 × 150 × 150 = 3,375,000 element array. Calling lazy on that array won't speed up anything.
To make product calculate the Cartesian product lazily (i.e. one combination after the other), you simply have to (directly) pass a block to it:
first, *others = array
first.product(*others) do |combination|
# ...
end
(1..999).to_a
Is this method O(n)? I'm wondering if the conversion involves an implicit iteration so Ruby can write the values one-by-one into consecutive memory addresses.
The method is actually slightly worse than O(n). Not only does it do a naive iteration, but it doesn't check ahead of time what the size will be, so it has to repeatedly allocate more memory as it iterates. I've opened an issue for that aspect and it's been discussed a few times on the mailing list (and briefly added to ruby-core). The problem is that, like almost anything in Ruby, Range can be opened up and messed with, so Ruby can't really optimize the method. It can't even count on Range#size returning the correct result. Worse, some enumerables even have their size method delegate to to_a.
In general, it shouldn't be necessary to make this conversion, but if you really need array methods, you might be able to use Array#fill instead, which lets you populate a (potentially pre-allocated) array using values derived from its indices.
Range.instance_methods(false).include? :to_a
# false
Range doesn't have to_a, it inherits it from the Enumerable mix-in, so it builds the array by pushing each value on one at a time. That seems like it would be really inefficient, but I'll let the benchmark speak for itself:
require 'benchmark'
size = 100_000_000
Benchmark.bmbm do |r|
r.report("range") { (0...size).to_a }
r.report("fill") { a = Array.new(size); a.fill { |i| i } }
r.report("array") { Array.new(size) { |i| i } }
end
# Rehearsal -----------------------------------------
# range 4.530000 0.180000 4.710000 ( 4.716628)
# fill 5.810000 0.150000 5.960000 ( 5.966710)
# array 7.630000 0.250000 7.880000 ( 7.879940)
# ------------------------------- total: 18.550000sec
#
# user system total real
# range 4.540000 0.120000 4.660000 ( 4.660249)
# fill 5.980000 0.110000 6.090000 ( 6.089962)
# array 7.880000 0.110000 7.990000 ( 7.985818)
Isn't that weird? It's actually the fastest by a significant margin. And manually filling an Array is somehow faster that just using the constructor.
But as is usually the case with Ruby, don't worry too much about this. For reasonably sized ranges the performance difference will be negligible.
I have evaluated access times for a two-dimensional array, implemented as
an array of arrays
a hash of arrays
a hash with arrays as keys
My expectation was to see similar acess times for all 3. I also have expected the measurements to yield similar results for MRI and JRuby.
However, for a reason that I do not understand, on MRI accessing elements within an array of arrays or within a hash of arrays is an order of magnitude faster than accessing elements of a hash.
On JRuby, instead of being 10 times as expensive, hash access was about 50 times as expensive as with an array of arrays.
The results:
MRI (2.1.1):
user system total real
hash of arrays: 1.300000 0.000000 1.300000 ( 1.302235)
array of arrays: 0.890000 0.000000 0.890000 ( 0.896942)
flat hash: 16.830000 0.000000 16.830000 ( 16.861716)
JRuby (1.7.5):
user system total real
hash of arrays: 0.280000 0.000000 0.280000 ( 0.265000)
array of arrays: 0.250000 0.000000 0.250000 ( 0.182000)
flat hash: 77.450000 0.240000 77.690000 ( 75.156000)
Here are two of my benchmarks:
ary = (0...n).map { Array.new(n, 1) }
bm.report('array of arrays:') do
iterations.times do
(0...n).each { |x|
(0...n).each { |y|
v = ary[x][y]
}
}
end
end
.
hash = {}
(0...n).each { |x|
(0...n).each { |y|
hash[[x, y]] = 1
}
}
prepared_indices = (0...n).each_with_object([]) { |x, ind|
(0...n).each { |y|
ind << [x, y]
}
}
bm.report('flat hash:') do
iterations.times do
prepared_indices.each { |i|
v = hash[i]
}
end
end
All container elements are initialized with a numeric value and have the same total number of elements.
The arrays for accessing the hash are preinitialized in order to benchmark the element access only.
Here is the complete code
I have consulted this thread and this article but still have no clue about the unexpected performance differences.
Why are the results so different from my expectations? What am I missing?
Consider the memory layout of an array of arrays, say with dimensions 3x3... you've got something like this:
memory address usage/content
base [0][0]
base+sizeof(int) [0][1]
base+2*sizeof(int) [0][2]
base+3*sizeof(int) [1][0]
base+4*sizeof(int) [1][1]
...
Given an array of dimensions [M][N], all that's needed to access an element in at indices [i][j] is to add the base memory address to the data element size times (i * M + j)... a tiny bit of simple arithmetic, and therefore extremely fast.
Hashes are far more complicated data structures and inherently slower. With a hash, you need to take time to hash the key (and the harder the hash tries to make sure different keys will - statistically - scatter pretty randomly throughout the hash output range even if they're similar keys the slower the hash tends to be, if the hash function doesn't make that effort you'll have more collisions in the hash table and slower performance there), then the hash value needs to be mapped on to the current hash table size (usually using "%"), then you need to compare keys to see if you've found the hoped-for key or a colliding element or an empty element. It's a far more involved process than array indexing. You should probably do some background reading about hash function and hash table implementations....
The reason hashes are so often useful is that the key doesn't need to be numeric (you can always work out some formula to generate a number from arbitrary key data) and need not be near-contiguous for memory efficiency (i.e. a hash table with say memory capacity for 5 integers could happily store keys 1, 1000 and 12398239 - whereas for an array keyed on those values there would be a lot of virtual address space wasted for all the indices in between, which have no data anyway, and anyway more data packed into a memory page means more cache hits).
Further - you should be careful with benchmarks - when you do clearly repetitive work with unchanging values overwriting the same variable, an optimiser may avoid it and you may not be timing what you think you are. It's good to use some run-time inputs (e.g. storing different values in the containers) and accumulate some dependent result (e.g. summing the element accesses rather than overwriting it), then outputting the result so any lazy evaluation is forced to conclude. With things like JITs and VMs involved there can also be kinks in your benchmarks as compilation kicks in or branch prediction results are incorporated.
I'm attempting to solve http://projecteuler.net/problem=1.
I want to create a method which takes in an integer and then creates an array of all the integers preceding it and the integer itself as values within the array.
Below is what I have so far. Code doesn't work.
def make_array(num)
numbers = Array.new num
count = 1
numbers.each do |number|
numbers << number = count
count = count + 1
end
return numbers
end
make_array(10)
(1..num).to_a is all you need to do in Ruby.
1..num will create a Range object with start at 1 and end at whatever value num is. Range objects have to_a method to blow them up into real Arrays by enumerating each element within the range.
For most purposes, you won't actually need the Array - Range will work fine. That includes iteration (which is what I assume you want, given the problem you're working on).
That said, knowing how to create such an Array "by hand" is valuable learning experience, so you might want to keep working on it a bit. Hint: you want to start with an empty array ([]) instead with Array.new num, then iterate something num.times, and add numbers into the Array. If you already start with an Array of size num, and then push num elements into it, you'll end up with twice num elements. If, as is your case, you're adding elements while you're iterating the array, the loop never exits, because for each element you process, you add another one. It's like chasing a metal ball with the repulsing side of a magnet.
To answer the Euler Question:
(1 ... 1000).to_a.select{|x| x%3==0 || x%5==0}.reduce(:+) # => 233168
Sometimes a one-liner is more readable than more detailed code i think.
Assuming you are learning Ruby by examples on ProjectEuler, i'll explain what the line does:
(1 ... 1000).to_a
will create an array with the numbers one to 999. Euler-Question wants numbers below 1000. Using three dots in a Range will create it without the boundary-value itself.
.select{|x| x%3==0 || x%5==0}
chooses only elements which are divideable by 3 or 5, and therefore multiples of 3 or 5. The other values are discarded. The result of this operation is a new Array with only multiples of 3 or 5.
.reduce(:+)
Finally this operation will sum up all the numbers in the array (or reduce it to) a single number: The sum you need for the solution.
What i want to illustrate: many methods you would write by hand everyday are already integrated in ruby, since it is a language from programmers for programmers. be pragmatic ;)
Ok, so say you have a really big Range in ruby. I want to find a way to get the max value in the Range.
The Range is exclusive (defined with three dots) meaning that it does not include the end object in it's results. It could be made up of Integer, String, Time, or really any object that responds to #<=> and #succ. (which are the only requirements for the start/end object in Range)
Here's an example of an exclusive range:
past = Time.local(2010, 1, 1, 0, 0, 0)
now = Time.now
range = past...now
range.include?(now) # => false
Now I know I could just do something like this to get the max value:
range.max # => returns 1 second before "now" using Enumerable#max
But this will take a non-trivial amount of time to execute. I also know that I could subtract 1 second from whatever the end object is is. However, the object may be something other than Time, and it may not even support #-. I would prefer to find an efficient general solution, but I am willing to combine special case code with a fallback to a general solution (more on that later).
As mentioned above using Range#last won't work either, because it's an exclusive range and does not include the last value in it's results.
The fastest approach I could think of was this:
max = nil
range.each { |value| max = value }
# max now contains nil if the range is empty, or the max value
This is similar to what Enumerable#max does (which Range inherits), except that it exploits the fact that each value is going to be greater than the previous, so we can skip using #<=> to compare the each value with the previous (the way Range#max does) saving a tiny bit of time.
The other approach I was thinking about was to have special case code for common ruby types like Integer, String, Time, Date, DateTime, and then use the above code as a fallback. It'd be a bit ugly, but probably much more efficient when those object types are encountered because I could use subtraction from Range#last to get the max value without any iterating.
Can anyone think of a more efficient/faster approach than this?
The simplest solution that I can think of, which will work for inclusive as well as exclusive ranges:
range.max
Some other possible solutions:
range.entries.last
range.entries[-1]
These solutions are all O(n), and will be very slow for large ranges. The problem in principle is that range values in Ruby are enumerated using the succ method iteratively on all values, starting at the beginning. The elements do not have to implement a method to return the previous value (i.e. pred).
The fastest method would be to find the predecessor of the last item (an O(1) solution):
range.exclude_end? ? range.last.pred : range.last
This works only for ranges that have elements which implement pred. Later versions of Ruby implement pred for integers. You have to add the method yourself if it does not exist (essentially equivalent to special case code you suggested, but slightly simpler to implement).
Some quick benchmarking shows that this last method is the fastest by many orders of magnitude for large ranges (in this case range = 1...1000000), because it is O(1):
user system total real
r.entries.last 11.760000 0.880000 12.640000 ( 12.963178)
r.entries[-1] 11.650000 0.800000 12.450000 ( 12.627440)
last = nil; r.each { |v| last = v } 20.750000 0.020000 20.770000 ( 20.910416)
r.max 17.590000 0.010000 17.600000 ( 17.633006)
r.exclude_end? ? r.last.pred : r.last 0.000000 0.000000 0.000000 ( 0.000062)
Benchmark code is here.
In the comments it is suggested to use range.last - (range.exclude_end? ? 1 : 0). It does work for dates without additional methods, but will never work for non-numeric ranges. String#- does not exist and makes no sense with integer arguments. String#pred, however, can be implented.
I'm not sure about the speed (and initial tests don't seem incredibly fast), but the following might do what you need:
past = Time.local(2010, 1, 1, 0, 0, 0)
now = Time.now
range = past...now
range.to_a[-1]
Very basic testing (counting in my head) showed that it took about 4 seconds while the method you provided took about 5-6. Hope this helps.
Edit 1: Removed second solution as it was totally wrong.
I can't think there's any way to achieve this that doesn't involve enumerating the range, at least unless as already mentioned, you have other information about how the range will be constructed and therefore can infer the desired value without enumeration. Of all the suggestions, I'd go with #max, since it seems to be most expressive.
require 'benchmark'
N = 20
Benchmark.bm(30) do |r|
past, now = Time.local(2010, 2, 1, 0, 0, 0), Time.now
#range = past...now
r.report("range.max") do
N.times { last_in_range = #range.max }
end
r.report("explicit enumeration") do
N.times { #range.each { |value| last_in_range = value } }
end
r.report("range.entries.last") do
N.times { last_in_range = #range.entries.last }
end
r.report("range.to_a[-1]") do
N.times { last_in_range = #range.to_a[-1] }
end
end
user system total real
range.max 49.406000 1.515000 50.921000 ( 50.985000)
explicit enumeration 52.250000 1.719000 53.969000 ( 54.156000)
range.entries.last 53.422000 4.844000 58.266000 ( 58.390000)
range.to_a[-1] 49.187000 5.234000 54.421000 ( 54.500000)
I notice that the 3rd and 4th option have significantly increased system time. I expect that's related to the explicit creation of an array, which seems like a good reason to avoid them, even if they're not obviously more expensive in elapsed time.