How fast is Ruby's method for converting a range to an array? - ruby

(1..999).to_a
Is this method O(n)? I'm wondering if the conversion involves an implicit iteration so Ruby can write the values one-by-one into consecutive memory addresses.

The method is actually slightly worse than O(n). Not only does it do a naive iteration, but it doesn't check ahead of time what the size will be, so it has to repeatedly allocate more memory as it iterates. I've opened an issue for that aspect and it's been discussed a few times on the mailing list (and briefly added to ruby-core). The problem is that, like almost anything in Ruby, Range can be opened up and messed with, so Ruby can't really optimize the method. It can't even count on Range#size returning the correct result. Worse, some enumerables even have their size method delegate to to_a.
In general, it shouldn't be necessary to make this conversion, but if you really need array methods, you might be able to use Array#fill instead, which lets you populate a (potentially pre-allocated) array using values derived from its indices.

Range.instance_methods(false).include? :to_a
# false
Range doesn't have to_a, it inherits it from the Enumerable mix-in, so it builds the array by pushing each value on one at a time. That seems like it would be really inefficient, but I'll let the benchmark speak for itself:
require 'benchmark'
size = 100_000_000
Benchmark.bmbm do |r|
r.report("range") { (0...size).to_a }
r.report("fill") { a = Array.new(size); a.fill { |i| i } }
r.report("array") { Array.new(size) { |i| i } }
end
# Rehearsal -----------------------------------------
# range 4.530000 0.180000 4.710000 ( 4.716628)
# fill 5.810000 0.150000 5.960000 ( 5.966710)
# array 7.630000 0.250000 7.880000 ( 7.879940)
# ------------------------------- total: 18.550000sec
#
# user system total real
# range 4.540000 0.120000 4.660000 ( 4.660249)
# fill 5.980000 0.110000 6.090000 ( 6.089962)
# array 7.880000 0.110000 7.990000 ( 7.985818)
Isn't that weird? It's actually the fastest by a significant margin. And manually filling an Array is somehow faster that just using the constructor.
But as is usually the case with Ruby, don't worry too much about this. For reasonably sized ranges the performance difference will be negligible.

Related

How expensive is generating a random number in Ruby?

Say you want to generate a random number between 1 and 1 billion:
rand(1..1_000_000_000)
Will Ruby create an array from that range every time you call this line of code?
Rubocop suggests this approach over rand(1_000_000_000)+1 but it seems there's potential for pain.
Ruby's docs say this:
# When +max+ is a Range, +rand+ returns a random number where
# range.member?(number) == true.
Where +max+ is the argument passed to rand, but it doesn't say how it gets the number argument. I'm also not sure if calling .member? on a range is performant.
Any ideas?
I can use benchmark but still curious about the inner workings here.
No, Ruby will not create an array from that range, unless you explicitly call the .to_a method on the Range object. In fact, rand() doesn't work on arrays - .sample is the method to use for returning a random element from an array.
The Range class includes Enumerable so you get Enumerable's iteration methods without having to convert the range into an array. The lower and upper limits for a Range are (-Float::INFINITY..Float::INFINITY), although that will result in a Numerical argument out of domain error if you pass it into rand.
As for .member?, that method simply calls a C function called range_cover that calls another one called r_cover_p which checks if a value is between two numbers or strings.
To test the difference in speed between passing a range to rand and calling sample on an array, you can perform the following test:
require 'benchmark'
puts Benchmark.measure { rand(0..10_000_000) }
=> 0.000000 0.000000 0.000000 ( 0.000009)
puts Benchmark.measure { (0..10_000_000).to_a.sample }
=> 0.300000 0.030000 0.330000 ( 0.347752)
As you can see in the first example, passing in a range as a parameter to rand is extremely rapid.
Contrarily, calling .to_a.sample on a range is rather slow. This is due to the array creation process which requires allocating the appropriate data into memory. The .sample method should be relatively fast as it simply passes a random and unique index into the array and returns that element.
To check out the code for range have a look here.

More concise way of setting a counter, incrementing, and returning?

I'm working on a problem in which I want to compare two strings of equal length, char by char. For each index where the chars differ, I need to increment the counter by 1. Right now I have this:
def compute(strand1, strand2)
raise ArgumentError, "Sequences are different lengths" unless strand1.length == strand2.length
mutations = 0
strand1.chars.each_with_index { |nucleotide, index| mutations += 1 if nucleotide != strand2[index] }
mutations
end
end
I'm new to this language but this feels very non-Ruby to me. Is there a one-liner that can consolidate the process of setting a counter, incrementing it, and then returning it?
I was thinking along the lines of selecting all chars that don't match up and then getting the size of the resulting array. However, from what I could tell, there is no select_with_index method. I was also looking into inject but can't seem to figure out how I would apply it in this scenario.
Probably the simplest answer is to just count the differences:
strand1.chars.zip(strand2.chars).count{|a, b| a != b}
Here is a number of one-liners:
strand1.size.times.select { |index| strand1[index] != strand2[index] }.size
This is not the best one since it generates an intermediary array up to O(n) in size.
strand1.size.times.inject(0) { |sum, index| sum += 1 if strand1[index] != strand2[index]; sum }
Does not take extra memory but is a bit hard to read.
strand1.chars.each_with_index.count { |x, index| x != strand2[index] }
I'd go with that. Kudos to user12341234 for mentioning count which is this is built upon.
Update
Benchmarking on my machine gives different results from what #CarySwoveland gets:
user system total real
mikej 4.080000 0.320000 4.400000 ( 4.408245)
user12341234 3.790000 0.210000 4.000000 ( 4.003349)
Nik 2.830000 0.170000 3.000000 ( 3.008533)
user system total real
mikej 3.590000 0.020000 3.610000 ( 3.618018)
user12341234 4.040000 0.140000 4.180000 ( 4.183357)
lazy user 4.250000 0.240000 4.490000 ( 4.497161)
Nik 2.790000 0.010000 2.800000 ( 2.808378)
This is not to point out my code as having better performance, but to mention that environment plays a great role in choosing any particular implementation approach.
I'm running this on native 3.13.0-24-generic #47-Ubuntu SMP x64 on 12-Core i7-3930K with enough RAM.
This is an just extended comment, so no upvotes please (downvotes are OK).
Gentlemen: start your engines!
Test problem
def random_string(n, selection)
n.times.each_with_object('') { |_,s| s << selection.sample }
end
n = 5_000_000
a = ('a'..'e').to_a
s1 = random_string(n,a)
s2 = random_string(n,a)
Benchmark code
require 'benchmark'
Benchmark.bm(12) do |x|
x.report('mikej') do
s1.chars.each_with_index.inject(0) {|c,(n,i)| n==s2[i] ? c : c+1}
end
x.report('user12341234') do
s1.chars.zip(s2.chars).count{|a,b| a != b }
end
x.report('lazy user') do
s1.chars.zip(s2.chars).lazy.count{|a,b| a != b }
end
x.report('Nik') do
s1.chars.each_with_index.count { |x,i| x != s2[i] }
end
end
Results
mikej 6.220000 0.430000 6.650000 ( 6.651575)
user12341234 6.600000 0.900000 7.500000 ( 7.504499)
lazy user 7.460000 7.800000 15.260000 ( 15.255265)
Nik 6.140000 3.080000 9.220000 ( 9.225023)
user system total real
mikej 6.190000 0.470000 6.660000 ( 6.662569)
user12341234 6.720000 0.500000 7.220000 ( 7.223716)
lazy user 7.250000 7.110000 14.360000 ( 14.356845)
Nik 5.690000 0.920000 6.610000 ( 6.621889)
[Edit: I added a lazy version of 'user12341234'. As is generally the case with lazy versions of enumerators, there is a tradeoff between the amount of memory used and execution time. I did several runs. Here I report the results of two typical ones. There was very little variability for 'mikej' and 'user12341234', somewhat more for lazy user and quite a bit for Nik.]
inject would indeed by a way to do this with a one-liner so you are on the right track:
strand1.chars.each_with_index.inject(0) { |count, (nucleotide, index)|
nucleotide == strand2[index] ? count : count + 1 }
We start with an initial value of 0 and then if the 2 letters are the same we just return the current value of the accumulator value, if the letters are different we add 1.
Also, notice here that each_with_index is being called without a block. When called like this it returns an enumerator object. This lets us use one of the other enumerator methods (in this case inject) with the pairs of value and index returned by each_with_index, allowing the functionality of each_with_index and inject to be combined.
Edit: looking at it, user12341234's solution is a nice way of doing it. Good use of zip! I'd probably go with that.

Unexpected access performance differences between arrays and hashes

I have evaluated access times for a two-dimensional array, implemented as
an array of arrays
a hash of arrays
a hash with arrays as keys
My expectation was to see similar acess times for all 3. I also have expected the measurements to yield similar results for MRI and JRuby.
However, for a reason that I do not understand, on MRI accessing elements within an array of arrays or within a hash of arrays is an order of magnitude faster than accessing elements of a hash.
On JRuby, instead of being 10 times as expensive, hash access was about 50 times as expensive as with an array of arrays.
The results:
MRI (2.1.1):
user system total real
hash of arrays: 1.300000 0.000000 1.300000 ( 1.302235)
array of arrays: 0.890000 0.000000 0.890000 ( 0.896942)
flat hash: 16.830000 0.000000 16.830000 ( 16.861716)
JRuby (1.7.5):
user system total real
hash of arrays: 0.280000 0.000000 0.280000 ( 0.265000)
array of arrays: 0.250000 0.000000 0.250000 ( 0.182000)
flat hash: 77.450000 0.240000 77.690000 ( 75.156000)
Here are two of my benchmarks:
ary = (0...n).map { Array.new(n, 1) }
bm.report('array of arrays:') do
iterations.times do
(0...n).each { |x|
(0...n).each { |y|
v = ary[x][y]
}
}
end
end
.
hash = {}
(0...n).each { |x|
(0...n).each { |y|
hash[[x, y]] = 1
}
}
prepared_indices = (0...n).each_with_object([]) { |x, ind|
(0...n).each { |y|
ind << [x, y]
}
}
bm.report('flat hash:') do
iterations.times do
prepared_indices.each { |i|
v = hash[i]
}
end
end
All container elements are initialized with a numeric value and have the same total number of elements.
The arrays for accessing the hash are preinitialized in order to benchmark the element access only.
Here is the complete code
I have consulted this thread and this article but still have no clue about the unexpected performance differences.
Why are the results so different from my expectations? What am I missing?
Consider the memory layout of an array of arrays, say with dimensions 3x3... you've got something like this:
memory address usage/content
base [0][0]
base+sizeof(int) [0][1]
base+2*sizeof(int) [0][2]
base+3*sizeof(int) [1][0]
base+4*sizeof(int) [1][1]
...
Given an array of dimensions [M][N], all that's needed to access an element in at indices [i][j] is to add the base memory address to the data element size times (i * M + j)... a tiny bit of simple arithmetic, and therefore extremely fast.
Hashes are far more complicated data structures and inherently slower. With a hash, you need to take time to hash the key (and the harder the hash tries to make sure different keys will - statistically - scatter pretty randomly throughout the hash output range even if they're similar keys the slower the hash tends to be, if the hash function doesn't make that effort you'll have more collisions in the hash table and slower performance there), then the hash value needs to be mapped on to the current hash table size (usually using "%"), then you need to compare keys to see if you've found the hoped-for key or a colliding element or an empty element. It's a far more involved process than array indexing. You should probably do some background reading about hash function and hash table implementations....
The reason hashes are so often useful is that the key doesn't need to be numeric (you can always work out some formula to generate a number from arbitrary key data) and need not be near-contiguous for memory efficiency (i.e. a hash table with say memory capacity for 5 integers could happily store keys 1, 1000 and 12398239 - whereas for an array keyed on those values there would be a lot of virtual address space wasted for all the indices in between, which have no data anyway, and anyway more data packed into a memory page means more cache hits).
Further - you should be careful with benchmarks - when you do clearly repetitive work with unchanging values overwriting the same variable, an optimiser may avoid it and you may not be timing what you think you are. It's good to use some run-time inputs (e.g. storing different values in the containers) and accumulate some dependent result (e.g. summing the element accesses rather than overwriting it), then outputting the result so any lazy evaluation is forced to conclude. With things like JITs and VMs involved there can also be kinks in your benchmarks as compilation kicks in or branch prediction results are incorporated.

Ruby - return an array in random order

What is the easiest way to return an array in random order in Ruby?
Anything that is nice and short that can be used in an IRB session like
[1,2,3,4,5].random()
# or
random_sort([1,2,3,4,5])
array.shuffle
If you don't have [].shuffle, [].sort_by{rand} works as pointed out by sepp2k. .sort_by temporarily replaces each element by something for the purpose of sorting, in this case, a random number.
[].sort{rand-0.5} however, won't properly shuffle. Some languages (e.g. some Javascript implementations) don't properly shuffle arrays if you do a random sort on the array, with sometimes rather public consequences.
JS Analysis (with graphs!): http://www.robweir.com/blog/2010/02/microsoft-random-browser-ballot.html
Ruby is no different! It has the same problem. :)
#sort a bunch of small arrays by rand-0.5
a=[]
100000.times{a << [0,1,2,3,4].sort{rand-0.5}}
#count how many times each number occurs in each position
b=[]
a.each do |x|
x.each_index do |i|
b[i] ||=[]
b[i][x[i]] ||= 0
b[i][x[i]] += 1
end
end
p b
=>
[[22336, 18872, 14814, 21645, 22333],
[17827, 25005, 20418, 18932, 17818],
[19665, 15726, 29575, 15522, 19512],
[18075, 18785, 20283, 24931, 17926],
[22097, 21612, 14910, 18970, 22411]]
Each element should occur in each position about 20000 times. [].sort_by(rand) gives much better results.
#sort with elements first mapped to random numbers
a=[]
100000.times{a << [0,1,2,3,4].sort_by{rand}}
#count how many times each number occurs in each position
...
=>
[[19913, 20074, 20148, 19974, 19891],
[19975, 19918, 20024, 20030, 20053],
[20028, 20061, 19914, 20088, 19909],
[20099, 19882, 19871, 19965, 20183],
[19985, 20065, 20043, 19943, 19964]]
Similarly for [].shuffle (which is probably fastest)
[[20011, 19881, 20222, 19961, 19925],
[19966, 20199, 20015, 19880, 19940],
[20062, 19894, 20065, 19965, 20014],
[19970, 20064, 19851, 20043, 20072],
[19991, 19962, 19847, 20151, 20049]]
What about this?
Helper methods for Enumerable, Array, Hash, and String
that let you pick a random item or shuffle the order of items.
http://raa.ruby-lang.org/project/rand/

Fastest way to get maximum value from an exclusive Range in ruby

Ok, so say you have a really big Range in ruby. I want to find a way to get the max value in the Range.
The Range is exclusive (defined with three dots) meaning that it does not include the end object in it's results. It could be made up of Integer, String, Time, or really any object that responds to #<=> and #succ. (which are the only requirements for the start/end object in Range)
Here's an example of an exclusive range:
past = Time.local(2010, 1, 1, 0, 0, 0)
now = Time.now
range = past...now
range.include?(now) # => false
Now I know I could just do something like this to get the max value:
range.max # => returns 1 second before "now" using Enumerable#max
But this will take a non-trivial amount of time to execute. I also know that I could subtract 1 second from whatever the end object is is. However, the object may be something other than Time, and it may not even support #-. I would prefer to find an efficient general solution, but I am willing to combine special case code with a fallback to a general solution (more on that later).
As mentioned above using Range#last won't work either, because it's an exclusive range and does not include the last value in it's results.
The fastest approach I could think of was this:
max = nil
range.each { |value| max = value }
# max now contains nil if the range is empty, or the max value
This is similar to what Enumerable#max does (which Range inherits), except that it exploits the fact that each value is going to be greater than the previous, so we can skip using #<=> to compare the each value with the previous (the way Range#max does) saving a tiny bit of time.
The other approach I was thinking about was to have special case code for common ruby types like Integer, String, Time, Date, DateTime, and then use the above code as a fallback. It'd be a bit ugly, but probably much more efficient when those object types are encountered because I could use subtraction from Range#last to get the max value without any iterating.
Can anyone think of a more efficient/faster approach than this?
The simplest solution that I can think of, which will work for inclusive as well as exclusive ranges:
range.max
Some other possible solutions:
range.entries.last
range.entries[-1]
These solutions are all O(n), and will be very slow for large ranges. The problem in principle is that range values in Ruby are enumerated using the succ method iteratively on all values, starting at the beginning. The elements do not have to implement a method to return the previous value (i.e. pred).
The fastest method would be to find the predecessor of the last item (an O(1) solution):
range.exclude_end? ? range.last.pred : range.last
This works only for ranges that have elements which implement pred. Later versions of Ruby implement pred for integers. You have to add the method yourself if it does not exist (essentially equivalent to special case code you suggested, but slightly simpler to implement).
Some quick benchmarking shows that this last method is the fastest by many orders of magnitude for large ranges (in this case range = 1...1000000), because it is O(1):
user system total real
r.entries.last 11.760000 0.880000 12.640000 ( 12.963178)
r.entries[-1] 11.650000 0.800000 12.450000 ( 12.627440)
last = nil; r.each { |v| last = v } 20.750000 0.020000 20.770000 ( 20.910416)
r.max 17.590000 0.010000 17.600000 ( 17.633006)
r.exclude_end? ? r.last.pred : r.last 0.000000 0.000000 0.000000 ( 0.000062)
Benchmark code is here.
In the comments it is suggested to use range.last - (range.exclude_end? ? 1 : 0). It does work for dates without additional methods, but will never work for non-numeric ranges. String#- does not exist and makes no sense with integer arguments. String#pred, however, can be implented.
I'm not sure about the speed (and initial tests don't seem incredibly fast), but the following might do what you need:
past = Time.local(2010, 1, 1, 0, 0, 0)
now = Time.now
range = past...now
range.to_a[-1]
Very basic testing (counting in my head) showed that it took about 4 seconds while the method you provided took about 5-6. Hope this helps.
Edit 1: Removed second solution as it was totally wrong.
I can't think there's any way to achieve this that doesn't involve enumerating the range, at least unless as already mentioned, you have other information about how the range will be constructed and therefore can infer the desired value without enumeration. Of all the suggestions, I'd go with #max, since it seems to be most expressive.
require 'benchmark'
N = 20
Benchmark.bm(30) do |r|
past, now = Time.local(2010, 2, 1, 0, 0, 0), Time.now
#range = past...now
r.report("range.max") do
N.times { last_in_range = #range.max }
end
r.report("explicit enumeration") do
N.times { #range.each { |value| last_in_range = value } }
end
r.report("range.entries.last") do
N.times { last_in_range = #range.entries.last }
end
r.report("range.to_a[-1]") do
N.times { last_in_range = #range.to_a[-1] }
end
end
user system total real
range.max 49.406000 1.515000 50.921000 ( 50.985000)
explicit enumeration 52.250000 1.719000 53.969000 ( 54.156000)
range.entries.last 53.422000 4.844000 58.266000 ( 58.390000)
range.to_a[-1] 49.187000 5.234000 54.421000 ( 54.500000)
I notice that the 3rd and 4th option have significantly increased system time. I expect that's related to the explicit creation of an array, which seems like a good reason to avoid them, even if they're not obviously more expensive in elapsed time.

Resources