More concise way of setting a counter, incrementing, and returning? - ruby

I'm working on a problem in which I want to compare two strings of equal length, char by char. For each index where the chars differ, I need to increment the counter by 1. Right now I have this:
def compute(strand1, strand2)
raise ArgumentError, "Sequences are different lengths" unless strand1.length == strand2.length
mutations = 0
strand1.chars.each_with_index { |nucleotide, index| mutations += 1 if nucleotide != strand2[index] }
mutations
end
end
I'm new to this language but this feels very non-Ruby to me. Is there a one-liner that can consolidate the process of setting a counter, incrementing it, and then returning it?
I was thinking along the lines of selecting all chars that don't match up and then getting the size of the resulting array. However, from what I could tell, there is no select_with_index method. I was also looking into inject but can't seem to figure out how I would apply it in this scenario.

Probably the simplest answer is to just count the differences:
strand1.chars.zip(strand2.chars).count{|a, b| a != b}

Here is a number of one-liners:
strand1.size.times.select { |index| strand1[index] != strand2[index] }.size
This is not the best one since it generates an intermediary array up to O(n) in size.
strand1.size.times.inject(0) { |sum, index| sum += 1 if strand1[index] != strand2[index]; sum }
Does not take extra memory but is a bit hard to read.
strand1.chars.each_with_index.count { |x, index| x != strand2[index] }
I'd go with that. Kudos to user12341234 for mentioning count which is this is built upon.
Update
Benchmarking on my machine gives different results from what #CarySwoveland gets:
user system total real
mikej 4.080000 0.320000 4.400000 ( 4.408245)
user12341234 3.790000 0.210000 4.000000 ( 4.003349)
Nik 2.830000 0.170000 3.000000 ( 3.008533)
user system total real
mikej 3.590000 0.020000 3.610000 ( 3.618018)
user12341234 4.040000 0.140000 4.180000 ( 4.183357)
lazy user 4.250000 0.240000 4.490000 ( 4.497161)
Nik 2.790000 0.010000 2.800000 ( 2.808378)
This is not to point out my code as having better performance, but to mention that environment plays a great role in choosing any particular implementation approach.
I'm running this on native 3.13.0-24-generic #47-Ubuntu SMP x64 on 12-Core i7-3930K with enough RAM.

This is an just extended comment, so no upvotes please (downvotes are OK).
Gentlemen: start your engines!
Test problem
def random_string(n, selection)
n.times.each_with_object('') { |_,s| s << selection.sample }
end
n = 5_000_000
a = ('a'..'e').to_a
s1 = random_string(n,a)
s2 = random_string(n,a)
Benchmark code
require 'benchmark'
Benchmark.bm(12) do |x|
x.report('mikej') do
s1.chars.each_with_index.inject(0) {|c,(n,i)| n==s2[i] ? c : c+1}
end
x.report('user12341234') do
s1.chars.zip(s2.chars).count{|a,b| a != b }
end
x.report('lazy user') do
s1.chars.zip(s2.chars).lazy.count{|a,b| a != b }
end
x.report('Nik') do
s1.chars.each_with_index.count { |x,i| x != s2[i] }
end
end
Results
mikej 6.220000 0.430000 6.650000 ( 6.651575)
user12341234 6.600000 0.900000 7.500000 ( 7.504499)
lazy user 7.460000 7.800000 15.260000 ( 15.255265)
Nik 6.140000 3.080000 9.220000 ( 9.225023)
user system total real
mikej 6.190000 0.470000 6.660000 ( 6.662569)
user12341234 6.720000 0.500000 7.220000 ( 7.223716)
lazy user 7.250000 7.110000 14.360000 ( 14.356845)
Nik 5.690000 0.920000 6.610000 ( 6.621889)
[Edit: I added a lazy version of 'user12341234'. As is generally the case with lazy versions of enumerators, there is a tradeoff between the amount of memory used and execution time. I did several runs. Here I report the results of two typical ones. There was very little variability for 'mikej' and 'user12341234', somewhat more for lazy user and quite a bit for Nik.]

inject would indeed by a way to do this with a one-liner so you are on the right track:
strand1.chars.each_with_index.inject(0) { |count, (nucleotide, index)|
nucleotide == strand2[index] ? count : count + 1 }
We start with an initial value of 0 and then if the 2 letters are the same we just return the current value of the accumulator value, if the letters are different we add 1.
Also, notice here that each_with_index is being called without a block. When called like this it returns an enumerator object. This lets us use one of the other enumerator methods (in this case inject) with the pairs of value and index returned by each_with_index, allowing the functionality of each_with_index and inject to be combined.
Edit: looking at it, user12341234's solution is a nice way of doing it. Good use of zip! I'd probably go with that.

Related

How expensive is generating a random number in Ruby?

Say you want to generate a random number between 1 and 1 billion:
rand(1..1_000_000_000)
Will Ruby create an array from that range every time you call this line of code?
Rubocop suggests this approach over rand(1_000_000_000)+1 but it seems there's potential for pain.
Ruby's docs say this:
# When +max+ is a Range, +rand+ returns a random number where
# range.member?(number) == true.
Where +max+ is the argument passed to rand, but it doesn't say how it gets the number argument. I'm also not sure if calling .member? on a range is performant.
Any ideas?
I can use benchmark but still curious about the inner workings here.
No, Ruby will not create an array from that range, unless you explicitly call the .to_a method on the Range object. In fact, rand() doesn't work on arrays - .sample is the method to use for returning a random element from an array.
The Range class includes Enumerable so you get Enumerable's iteration methods without having to convert the range into an array. The lower and upper limits for a Range are (-Float::INFINITY..Float::INFINITY), although that will result in a Numerical argument out of domain error if you pass it into rand.
As for .member?, that method simply calls a C function called range_cover that calls another one called r_cover_p which checks if a value is between two numbers or strings.
To test the difference in speed between passing a range to rand and calling sample on an array, you can perform the following test:
require 'benchmark'
puts Benchmark.measure { rand(0..10_000_000) }
=> 0.000000 0.000000 0.000000 ( 0.000009)
puts Benchmark.measure { (0..10_000_000).to_a.sample }
=> 0.300000 0.030000 0.330000 ( 0.347752)
As you can see in the first example, passing in a range as a parameter to rand is extremely rapid.
Contrarily, calling .to_a.sample on a range is rather slow. This is due to the array creation process which requires allocating the appropriate data into memory. The .sample method should be relatively fast as it simply passes a random and unique index into the array and returns that element.
To check out the code for range have a look here.

Optimizing Array Memory Usage

I currently have a very large array of permutations, which is currently using a significant amount of RAM. This is the current code I have which SHOULD:
Count all but the occurrences where more than one '1' exists or three '2's exist in a row.
arr = [*1..3].repeated_permutation(30).to_a;
count = 0
arr.each do |x|
if not x.join('').include? '222' and x.count(1) < 2
count += 1
end
end
print count
So basically this results in a 24,360 element array, each of which have 30 elements.
I've tried to run it through Terminal but it literally ate through 14GB of RAM, and didn't move for 15 minutes, so I'm not sure whether the process froze while attempting to access more RAM or if it was still computing.
My question being: is there a faster way of doing this?
Thanks!
I am not sure what problem you try to solve. If your code is just an example for a more complex problem and you really need to check programatically every single permumation, then you might want to experiment with lazy:
[*1..3].repeated_permutation(30).lazy.each do ||
# your condition
end
Or you might want to make the nested iteratior very explicit:
[1,2,3].each do |x1|
[1,2,3].each do |x2|
[1,2,3].each do |x3|
# ...
[1,2,3].each do |x30|
permutation = [x1,x2,x3, ... , x30]
# your condition
end
end
end
end
end
But it feels wrong to me to solve this kind of problem with Ruby enumerables at all. Let's have a look at your strings:
111111111111111111111111111111
111111111111111111111111111112
111111111111111111111111111113
111111111111111111111111111121
111111111111111111111111111122
111111111111111111111111111123
111111111111111111111111111131
...
333333333333333333333333333323
333333333333333333333333333331
333333333333333333333333333332
333333333333333333333333333333
I suggest to just use enumerative combinatorics. Just look at the patterns and analyse (or count) how often your condition can be true. For example there are 28 indexes in your string at which a 222 substring could be place, only 27 for the 2222 substring... If you place a substring how likely is it that there is no 1 in the other parts of the string?
I think your problem is a mathematics problem, not a programming problem.
NB This is an incomplete answer, but I think the idea might give a push to the proper solution.
I can think of a following approach: let’s represent each permutation as a value in ternary number base, padded by zeroes:
1 = 000..00001
2 = 000..00002
3 = 000..00010
4 = 000..00011
5 = 000..00012
...
Now consider we restated the original task, treating zeroes as ones, ones as twos and twos as threes. So far so good.
The whole list of permutations would be represented by:
(1..3**30-1).map { |e| x = e.to_s(3).rjust(30, '0') }
Now we are to apply your conditions:
def do_calc permutation_count
(1..3**permutation_count-1).inject do |memo, e|
x = e.to_s(3).rjust(permutation_count, '0')
!x.include?('111') && x.count('0') < 2 ? memo + 1 : memo
end
Unfortunately, even for permutation_count == 20 it takes more than 5 minutes to calculate, so probably some additional steps are required. I will be thinking of further optimization. Currently I hope this will give you a hint to find the good approach yourself.

How fast is Ruby's method for converting a range to an array?

(1..999).to_a
Is this method O(n)? I'm wondering if the conversion involves an implicit iteration so Ruby can write the values one-by-one into consecutive memory addresses.
The method is actually slightly worse than O(n). Not only does it do a naive iteration, but it doesn't check ahead of time what the size will be, so it has to repeatedly allocate more memory as it iterates. I've opened an issue for that aspect and it's been discussed a few times on the mailing list (and briefly added to ruby-core). The problem is that, like almost anything in Ruby, Range can be opened up and messed with, so Ruby can't really optimize the method. It can't even count on Range#size returning the correct result. Worse, some enumerables even have their size method delegate to to_a.
In general, it shouldn't be necessary to make this conversion, but if you really need array methods, you might be able to use Array#fill instead, which lets you populate a (potentially pre-allocated) array using values derived from its indices.
Range.instance_methods(false).include? :to_a
# false
Range doesn't have to_a, it inherits it from the Enumerable mix-in, so it builds the array by pushing each value on one at a time. That seems like it would be really inefficient, but I'll let the benchmark speak for itself:
require 'benchmark'
size = 100_000_000
Benchmark.bmbm do |r|
r.report("range") { (0...size).to_a }
r.report("fill") { a = Array.new(size); a.fill { |i| i } }
r.report("array") { Array.new(size) { |i| i } }
end
# Rehearsal -----------------------------------------
# range 4.530000 0.180000 4.710000 ( 4.716628)
# fill 5.810000 0.150000 5.960000 ( 5.966710)
# array 7.630000 0.250000 7.880000 ( 7.879940)
# ------------------------------- total: 18.550000sec
#
# user system total real
# range 4.540000 0.120000 4.660000 ( 4.660249)
# fill 5.980000 0.110000 6.090000 ( 6.089962)
# array 7.880000 0.110000 7.990000 ( 7.985818)
Isn't that weird? It's actually the fastest by a significant margin. And manually filling an Array is somehow faster that just using the constructor.
But as is usually the case with Ruby, don't worry too much about this. For reasonably sized ranges the performance difference will be negligible.

Unexpected access performance differences between arrays and hashes

I have evaluated access times for a two-dimensional array, implemented as
an array of arrays
a hash of arrays
a hash with arrays as keys
My expectation was to see similar acess times for all 3. I also have expected the measurements to yield similar results for MRI and JRuby.
However, for a reason that I do not understand, on MRI accessing elements within an array of arrays or within a hash of arrays is an order of magnitude faster than accessing elements of a hash.
On JRuby, instead of being 10 times as expensive, hash access was about 50 times as expensive as with an array of arrays.
The results:
MRI (2.1.1):
user system total real
hash of arrays: 1.300000 0.000000 1.300000 ( 1.302235)
array of arrays: 0.890000 0.000000 0.890000 ( 0.896942)
flat hash: 16.830000 0.000000 16.830000 ( 16.861716)
JRuby (1.7.5):
user system total real
hash of arrays: 0.280000 0.000000 0.280000 ( 0.265000)
array of arrays: 0.250000 0.000000 0.250000 ( 0.182000)
flat hash: 77.450000 0.240000 77.690000 ( 75.156000)
Here are two of my benchmarks:
ary = (0...n).map { Array.new(n, 1) }
bm.report('array of arrays:') do
iterations.times do
(0...n).each { |x|
(0...n).each { |y|
v = ary[x][y]
}
}
end
end
.
hash = {}
(0...n).each { |x|
(0...n).each { |y|
hash[[x, y]] = 1
}
}
prepared_indices = (0...n).each_with_object([]) { |x, ind|
(0...n).each { |y|
ind << [x, y]
}
}
bm.report('flat hash:') do
iterations.times do
prepared_indices.each { |i|
v = hash[i]
}
end
end
All container elements are initialized with a numeric value and have the same total number of elements.
The arrays for accessing the hash are preinitialized in order to benchmark the element access only.
Here is the complete code
I have consulted this thread and this article but still have no clue about the unexpected performance differences.
Why are the results so different from my expectations? What am I missing?
Consider the memory layout of an array of arrays, say with dimensions 3x3... you've got something like this:
memory address usage/content
base [0][0]
base+sizeof(int) [0][1]
base+2*sizeof(int) [0][2]
base+3*sizeof(int) [1][0]
base+4*sizeof(int) [1][1]
...
Given an array of dimensions [M][N], all that's needed to access an element in at indices [i][j] is to add the base memory address to the data element size times (i * M + j)... a tiny bit of simple arithmetic, and therefore extremely fast.
Hashes are far more complicated data structures and inherently slower. With a hash, you need to take time to hash the key (and the harder the hash tries to make sure different keys will - statistically - scatter pretty randomly throughout the hash output range even if they're similar keys the slower the hash tends to be, if the hash function doesn't make that effort you'll have more collisions in the hash table and slower performance there), then the hash value needs to be mapped on to the current hash table size (usually using "%"), then you need to compare keys to see if you've found the hoped-for key or a colliding element or an empty element. It's a far more involved process than array indexing. You should probably do some background reading about hash function and hash table implementations....
The reason hashes are so often useful is that the key doesn't need to be numeric (you can always work out some formula to generate a number from arbitrary key data) and need not be near-contiguous for memory efficiency (i.e. a hash table with say memory capacity for 5 integers could happily store keys 1, 1000 and 12398239 - whereas for an array keyed on those values there would be a lot of virtual address space wasted for all the indices in between, which have no data anyway, and anyway more data packed into a memory page means more cache hits).
Further - you should be careful with benchmarks - when you do clearly repetitive work with unchanging values overwriting the same variable, an optimiser may avoid it and you may not be timing what you think you are. It's good to use some run-time inputs (e.g. storing different values in the containers) and accumulate some dependent result (e.g. summing the element accesses rather than overwriting it), then outputting the result so any lazy evaluation is forced to conclude. With things like JITs and VMs involved there can also be kinks in your benchmarks as compilation kicks in or branch prediction results are incorporated.

Fastest way to get maximum value from an exclusive Range in ruby

Ok, so say you have a really big Range in ruby. I want to find a way to get the max value in the Range.
The Range is exclusive (defined with three dots) meaning that it does not include the end object in it's results. It could be made up of Integer, String, Time, or really any object that responds to #<=> and #succ. (which are the only requirements for the start/end object in Range)
Here's an example of an exclusive range:
past = Time.local(2010, 1, 1, 0, 0, 0)
now = Time.now
range = past...now
range.include?(now) # => false
Now I know I could just do something like this to get the max value:
range.max # => returns 1 second before "now" using Enumerable#max
But this will take a non-trivial amount of time to execute. I also know that I could subtract 1 second from whatever the end object is is. However, the object may be something other than Time, and it may not even support #-. I would prefer to find an efficient general solution, but I am willing to combine special case code with a fallback to a general solution (more on that later).
As mentioned above using Range#last won't work either, because it's an exclusive range and does not include the last value in it's results.
The fastest approach I could think of was this:
max = nil
range.each { |value| max = value }
# max now contains nil if the range is empty, or the max value
This is similar to what Enumerable#max does (which Range inherits), except that it exploits the fact that each value is going to be greater than the previous, so we can skip using #<=> to compare the each value with the previous (the way Range#max does) saving a tiny bit of time.
The other approach I was thinking about was to have special case code for common ruby types like Integer, String, Time, Date, DateTime, and then use the above code as a fallback. It'd be a bit ugly, but probably much more efficient when those object types are encountered because I could use subtraction from Range#last to get the max value without any iterating.
Can anyone think of a more efficient/faster approach than this?
The simplest solution that I can think of, which will work for inclusive as well as exclusive ranges:
range.max
Some other possible solutions:
range.entries.last
range.entries[-1]
These solutions are all O(n), and will be very slow for large ranges. The problem in principle is that range values in Ruby are enumerated using the succ method iteratively on all values, starting at the beginning. The elements do not have to implement a method to return the previous value (i.e. pred).
The fastest method would be to find the predecessor of the last item (an O(1) solution):
range.exclude_end? ? range.last.pred : range.last
This works only for ranges that have elements which implement pred. Later versions of Ruby implement pred for integers. You have to add the method yourself if it does not exist (essentially equivalent to special case code you suggested, but slightly simpler to implement).
Some quick benchmarking shows that this last method is the fastest by many orders of magnitude for large ranges (in this case range = 1...1000000), because it is O(1):
user system total real
r.entries.last 11.760000 0.880000 12.640000 ( 12.963178)
r.entries[-1] 11.650000 0.800000 12.450000 ( 12.627440)
last = nil; r.each { |v| last = v } 20.750000 0.020000 20.770000 ( 20.910416)
r.max 17.590000 0.010000 17.600000 ( 17.633006)
r.exclude_end? ? r.last.pred : r.last 0.000000 0.000000 0.000000 ( 0.000062)
Benchmark code is here.
In the comments it is suggested to use range.last - (range.exclude_end? ? 1 : 0). It does work for dates without additional methods, but will never work for non-numeric ranges. String#- does not exist and makes no sense with integer arguments. String#pred, however, can be implented.
I'm not sure about the speed (and initial tests don't seem incredibly fast), but the following might do what you need:
past = Time.local(2010, 1, 1, 0, 0, 0)
now = Time.now
range = past...now
range.to_a[-1]
Very basic testing (counting in my head) showed that it took about 4 seconds while the method you provided took about 5-6. Hope this helps.
Edit 1: Removed second solution as it was totally wrong.
I can't think there's any way to achieve this that doesn't involve enumerating the range, at least unless as already mentioned, you have other information about how the range will be constructed and therefore can infer the desired value without enumeration. Of all the suggestions, I'd go with #max, since it seems to be most expressive.
require 'benchmark'
N = 20
Benchmark.bm(30) do |r|
past, now = Time.local(2010, 2, 1, 0, 0, 0), Time.now
#range = past...now
r.report("range.max") do
N.times { last_in_range = #range.max }
end
r.report("explicit enumeration") do
N.times { #range.each { |value| last_in_range = value } }
end
r.report("range.entries.last") do
N.times { last_in_range = #range.entries.last }
end
r.report("range.to_a[-1]") do
N.times { last_in_range = #range.to_a[-1] }
end
end
user system total real
range.max 49.406000 1.515000 50.921000 ( 50.985000)
explicit enumeration 52.250000 1.719000 53.969000 ( 54.156000)
range.entries.last 53.422000 4.844000 58.266000 ( 58.390000)
range.to_a[-1] 49.187000 5.234000 54.421000 ( 54.500000)
I notice that the 3rd and 4th option have significantly increased system time. I expect that's related to the explicit creation of an array, which seems like a good reason to avoid them, even if they're not obviously more expensive in elapsed time.

Resources