I recently installed Ruby 2.0.0 and found that it now has a lazy method for the Enumerable mixin. From previous experience in functional languages, I know that this makes for more efficient code.
I did a benchmark (not sure if it is moot) of lazy versus eager and found that lazy was continually faster. Why is this? What makes lazy evaluation better for large input?
Benchmark code:
#!/usr/bin/env ruby
require 'benchmark'
num = 1000
arr = (1..50000).to_a
Benchmark.bm do |rep|
rep.report('lazy') { num.times do ; arr.lazy.map { |x| x * 2 }; end }
rep.report('eager') { num.times do ; arr.map { |x| x * 2}; end }
end
Benchmark report sample:
user system total real
lazy 0.000000 0.000000 0.000000 ( 0.009502)
eager 5.550000 0.480000 6.030000 ( 6.231269)
It's so lazy it's not even doing the work - probably because you're not actually using the results of the operation. Put a sleep() in there to confirm:
> Benchmark.bm do |rep|
rep.report('lazy') { num.times do ; arr.lazy.map { |x| sleep(5) }; end }
rep.report('notlazy') { 1.times do ; [0,1].map { |x| sleep(5) } ; end }
end
user system total real
lazy 0.010000 0.000000 0.010000 ( 0.007130)
notlazy 0.000000 0.000000 0.000000 ( 10.001788)
Related
I ran a benchmark to see whether memoizing attributes was faster than reading from a configuration hash. The code below is an example. Can anybody explain them?
The Test
require 'benchmark'
class MyClassWithStuff
DEFAULT_VALS = { one: '1', two: 2, three: 3, four: 4 }
def memoized_fetch
#value ||= DEFAULT_VALS[:one]
end
def straight_fetch
DEFAULT_VALS[:one]
end
end
TIMES = 10000
CALL_TIMES = 1000
Benchmark.bmbm do |test|
test.report("Memoized") do
TIMES.times do
instance = MyClassWithStuff.new
CALL_TIMES.times { |i| instance.memoized_fetch }
end
end
test.report("Fetched") do
TIMES.times do
instance = MyClassWithStuff.new
CALL_TIMES.times { |i| instance.straight_fetch }
end
end
end
Results
Rehearsal --------------------------------------------
Memoized 1.500000 0.010000 1.510000 ( 1.510230)
Fetched 1.330000 0.000000 1.330000 ( 1.342800)
----------------------------------- total: 2.840000sec
user system total real
Memoized 1.440000 0.000000 1.440000 ( 1.456937)
Fetched 1.260000 0.000000 1.260000 ( 1.269904)
I have learned two array sorting methods in Ruby:
array = ["one", "two", "three"]
array.sort.reverse!
or:
array = ["one", "two", "three"]
array.sort { |x,y| y<=>x }
And I am not able to differentiate between the two. Which method is better and how exactly are they different in execution?
Both lines do the same (create a new array, which is reverse sorted). The main argument is about readability and performance. array.sort.reverse! is more readable than array.sort{|x,y| y<=>x} - I think we can agree here.
For the performance part, I created a quick benchmark script, which gives the following on my system (ruby 1.9.3p392 [x86_64-linux]):
user system total real
array.sort.reverse 1.330000 0.000000 1.330000 ( 1.334667)
array.sort.reverse! 1.200000 0.000000 1.200000 ( 1.198232)
array.sort!.reverse! 1.200000 0.000000 1.200000 ( 1.199296)
array.sort{|x,y| y<=>x} 5.220000 0.000000 5.220000 ( 5.239487)
Run times are pretty constant for multiple executions of the benchmark script.
array.sort.reverse (with or without !) is way faster than array.sort{|x,y| y<=>x}. Thus, I recommend that.
Here is the script as a Reference:
#!/usr/bin/env ruby
require 'benchmark'
Benchmark.bm do|b|
master = (1..1_000_000).map(&:to_s).shuffle
a = master.dup
b.report("array.sort.reverse ") do
a.sort.reverse
end
a = master.dup
b.report("array.sort.reverse! ") do
a.sort.reverse!
end
a = master.dup
b.report("array.sort!.reverse! ") do
a.sort!.reverse!
end
a = master.dup
b.report("array.sort{|x,y| y<=>x} ") do
a.sort{|x,y| y<=>x}
end
end
There really is no difference here. Both methods return a new array.
For the purposes of this example, simpler is better. I would recommend array.sort.reverse because it is much more readable than the alternative. Passing blocks to methods like sort should be saved for arrays of more complex data structures and user-defined classes.
Edit: While destructive methods (anything ending in a !) are good for performance games, it was pointed out that they aren't required to return an updated array, or anything at all for that matter. It is important to keep this in mind because array.sort.reverse! could very likely return nil. If you wish to use a destructive method on a newly generated array, you should prefer calling .reverse! on a separate line instead of having a one-liner.
Example:
array = array.sort
array.reverse!
should be preferred to
array = array.sort.reverse!
Reverse! is Faster
There's often no substitute for benchmarking. While it probably makes no difference in shorter scripts, the #reverse! method is significantly faster than sorting using the "spaceship" operator. For example, on MRI Ruby 2.0, and given the following benchmark code:
require 'benchmark'
array = ["one", "two", "three"]
loops = 1_000_000
Benchmark.bmbm do |bm|
bm.report('reverse!') { loops.times {array.sort.reverse!} }
bm.report('spaceship') { loops.times {array.sort {|x,y| y<=>x} }}
end
the system reports that #reverse! is almost twice as fast as using the combined comparison operator.
user system total real
reverse! 0.340000 0.000000 0.340000 ( 0.344198)
spaceship 0.590000 0.010000 0.600000 ( 0.595747)
My advice: use whichever is more semantically meaningful in a given context, unless you're running in a tight loop.
With comparison as simple as your example, there is not much difference, but as the formula for comparison gets complicated, it is better to avoid using <=> with a block because the block you pass will be evaluated for each element of the array, causing redundancy. Consider this:
array.sort{|x, y| some_expensive_method(x) <=> some_expensive_method(y)}
In this case, some_expensive_method will be evaluated for each possible pair of element of array.
In your particular case, use of a block with <=> can be avoided with reverse.
array.sort_by{|x| some_expensive_method(x)}.reverse
This is called Schwartzian transform.
In playing with tessi's benchmarks on my machine, I've gotten some interesting results. I'm running ruby 2.0.0p195 [x86_64-darwin12.3.0], i.e., latest release of Ruby 2 on an OS X system. I used bmbm rather than bm from the Benchmark module. My timings are:
Rehearsal -------------------------------------------------------------
array.sort.reverse: 1.010000 0.000000 1.010000 ( 1.020397)
array.sort.reverse!: 0.810000 0.000000 0.810000 ( 0.808368)
array.sort!.reverse!: 0.800000 0.010000 0.810000 ( 0.809666)
array.sort{|x,y| y<=>x}: 0.300000 0.000000 0.300000 ( 0.291002)
array.sort!{|x,y| y<=>x}: 0.100000 0.000000 0.100000 ( 0.105345)
---------------------------------------------------- total: 3.030000sec
user system total real
array.sort.reverse: 0.210000 0.000000 0.210000 ( 0.208378)
array.sort.reverse!: 0.030000 0.000000 0.030000 ( 0.027746)
array.sort!.reverse!: 0.020000 0.000000 0.020000 ( 0.020082)
array.sort{|x,y| y<=>x}: 0.110000 0.000000 0.110000 ( 0.107065)
array.sort!{|x,y| y<=>x}: 0.110000 0.000000 0.110000 ( 0.105359)
First, note that in the Rehearsal phase that sort! using a comparison block comes in as the clear winner. Matz must have tuned the heck out of it in Ruby 2!
The other thing that I found exceedingly weird was how much improvement array.sort.reverse! and array.sort!.reverse! exhibited in the production pass. It was so extreme it made me wonder whether I had somehow screwed up and passed these already sorted data, so I added explicit checks for sorted or reverse-sorted data prior to performing each benchmark.
My variant of tessi's script follows:
#!/usr/bin/env ruby
require 'benchmark'
class Array
def sorted?
(1...length).each {|i| return false if self[i] < self[i-1] }
true
end
def reversed?
(1...length).each {|i| return false if self[i] > self[i-1] }
true
end
end
master = (1..1_000_000).map(&:to_s).shuffle
Benchmark.bmbm(25) do|b|
a = master.dup
puts "uh-oh!" if a.sorted?
puts "oh-uh!" if a.reversed?
b.report("array.sort.reverse:") { a.sort.reverse }
a = master.dup
puts "uh-oh!" if a.sorted?
puts "oh-uh!" if a.reversed?
b.report("array.sort.reverse!:") { a.sort.reverse! }
a = master.dup
puts "uh-oh!" if a.sorted?
puts "oh-uh!" if a.reversed?
b.report("array.sort!.reverse!:") { a.sort!.reverse! }
a = master.dup
puts "uh-oh!" if a.sorted?
puts "oh-uh!" if a.reversed?
b.report("array.sort{|x,y| y<=>x}:") { a.sort{|x,y| y<=>x} }
a = master.dup
puts "uh-oh!" if a.sorted?
puts "oh-uh!" if a.reversed?
b.report("array.sort!{|x,y| y<=>x}:") { a.sort!{|x,y| y<=>x} }
end
I have been working on a small Plagiarism detection engine which uses Idea from MOSS.
I need a Rolling Hash function, I am inspired from Rabin-Karp Algorithm.
Code I wrote -->
#!/usr/bin/env ruby
#Designing a rolling hash function.
#Inspired from the Rabin-Karp Algorithm
module Myth
module Hasher
#Defining a Hash Structure
#A hash is a integer value + line number where the word for this hash existed in the source file
Struct.new('Hash',:value,:line_number)
#For hashing a piece of text we ned two sets of parameters
#k-->For buildinf units of k grams hashes
#q-->Prime which lets calculations stay within range
def calc_hash(text_to_process,k,q)
text_length=text_to_process.length
radix=26
highorder=(radix**(text_length-1))%q
#Individual hashes for k-grams
text_hash=0
#The entire k-grams hashes list for the document
text_hash_string=""
#Preprocessing
for c in 0...k do
text_hash=(radix*text_hash+text_to_process[c].ord)%q
end
text_hash_string << text_hash.to_s << " "
loop=text_length-k
for c in 0..loop do
puts text_hash
text_hash=(radix*(text_hash-text_to_process[c].ord*highorder)+(text_hash[c+k].ord))%q
text_hash_string << text_hash_string << " "
end
end
end
end
I am running it with values -->
calc_hash(text,5,101) where text is a String input.
The code is very slow. Where am I going wrong?
Look at Ruby-Prof, a profiler for Ruby. Use gem install ruby-prof to install it.
Once you have some ideas where the code is lagging, you can use Ruby's Benchmark to try different algorithms to find the fastest.
Nose around on StackOverflow and you'll see lots of places where we'll use Benchmark to test various methods to see which is the fastest. You'll also get an idea of different ways to set up the tests.
For instance, looking at your code, I wasn't sure whether an append, <<, was better than concatenating using + or using string interpolation. Here's the code to test that and the results:
require 'benchmark'
include Benchmark
n = 1_000_000
bm(13) do |x|
x.report("interpolate") { n.times { foo = "foo"; bar = "bar"; "#{foo}#{bar}" } }
x.report("concatenate") { n.times { foo = "foo"; bar = "bar"; foo + bar } }
x.report("append") { n.times { foo = "foo"; bar = "bar"; foo << bar } }
end
ruby test.rb; ruby test.rb
user system total real
interpolate 1.090000 0.000000 1.090000 ( 1.093071)
concatenate 0.860000 0.010000 0.870000 ( 0.865982)
append 0.750000 0.000000 0.750000 ( 0.753016)
user system total real
interpolate 1.080000 0.000000 1.080000 ( 1.085537)
concatenate 0.870000 0.000000 0.870000 ( 0.864697)
append 0.750000 0.000000 0.750000 ( 0.750866)
I was wondering about the effects of using fixed versus variables when appending strings based on #Myth17's comment below:
require 'benchmark'
include Benchmark
n = 1_000_000
bm(13) do |x|
x.report("interpolate") { n.times { foo = "foo"; bar = "bar"; "#{foo}#{bar}" } }
x.report("concatenate") { n.times { foo = "foo"; bar = "bar"; foo + bar } }
x.report("append") { n.times { foo = "foo"; bar = "bar"; foo << bar } }
x.report("append2") { n.times { foo = "foo"; bar = "bar"; "foo" << bar } }
x.report("append3") { n.times { foo = "foo"; bar = "bar"; "foo" << "bar" } }
end
Resulting in:
ruby test.rb;ruby test.rb
user system total real
interpolate 1.330000 0.000000 1.330000 ( 1.326833)
concatenate 1.080000 0.000000 1.080000 ( 1.084989)
append 0.940000 0.010000 0.950000 ( 0.937635)
append2 1.160000 0.000000 1.160000 ( 1.165974)
append3 1.400000 0.000000 1.400000 ( 1.397616)
user system total real
interpolate 1.320000 0.000000 1.320000 ( 1.325286)
concatenate 1.100000 0.000000 1.100000 ( 1.090585)
append 0.930000 0.000000 0.930000 ( 0.936956)
append2 1.160000 0.000000 1.160000 ( 1.157424)
append3 1.390000 0.000000 1.390000 ( 1.392742)
The values are different than my previous test because the code is being run on my laptop.
Appending two variables is faster than when a fixed string is involved because there is overhead; Ruby has to create an intermediate variable and then append to it.
The big lesson here is we can make a more informed decision when we're writing code because we know what runs faster. At the same time, the differences are not very big, since most code isn't running 1,000,000 loops. Your mileage might vary.
I want to see if one string contains a keyword in a keyword list.
I have the following function:
def needfilter?(src)
["keyowrd_1","keyowrd_2","keyowrd_3","keyowrd_4","keyowrd_5"].each do |kw|
return true if src.include?(kw)
end
false
end
Can this code block be simplified in to one line sentence?
I know it can be simplified to:
def needfilter?(src)
!["keyowrd_1","keyowrd_2","keyowrd_3","keyowrd_4","keyowrd_5"].select{|c| src.include?(c)}.empty?
end
But this approach is not so efficient if the keyword array list is very long.
Looks like a nice use case for Enumerable#any? method:
def needfilter?(src)
["keyowrd_1","keyowrd_2","keyowrd_3","keyowrd_4","keyowrd_5"].any? do |kw|
src.include? kw
end
end
I was curious what's the fastest solution and I created a benchmark of all answers up to now.
I modified steenslag answer a bit. For tuning reasons I create the regexp only once not for each test.
require 'benchmark'
KEYWORDS = ["keyowrd_1","keyowrd_2","keyowrd_3","keyowrd_4","keyowrd_5"]
TESTSTRINGS = ['xx', 'xxx', "keyowrd_2"]
N = 10_000 #Number of Test loops
def needfilter_orig?(src)
["keyowrd_1","keyowrd_2","keyowrd_3","keyowrd_4","keyowrd_5"].each do |kw|
return true if src.include?(kw)
end
false
end
def needfilter_orig2?(src)
!["keyowrd_1","keyowrd_2","keyowrd_3","keyowrd_4","keyowrd_5"].select{|c| src.include?(c)}.empty?
end
def needfilter_any?(src)
["keyowrd_1","keyowrd_2","keyowrd_3","keyowrd_4","keyowrd_5"].any? do |kw|
src.include? kw
end
end
def needfilter_regexp?(src)
!!(src =~ Regexp.union(KEYWORDS))
end
def needfilter_regexp_init?(src)
!!(src =~ $KEYWORDS_regexp)
end
def needfilter_split?(src)
(src.split(/ /) & KEYWORDS).empty?
end
Benchmark.bmbm(10) {|b|
b.report('orig') { N.times { TESTSTRINGS.each{|src| needfilter_orig?(src)} } }
b.report('orig2') { N.times { TESTSTRINGS.each{|src| needfilter_orig2?(src) } } }
b.report('any') { N.times { TESTSTRINGS.each{|src| needfilter_any?(src) } } }
b.report('regexp') { N.times { TESTSTRINGS.each{|src| needfilter_regexp?(src) } } }
b.report('regexp_init') {
$KEYWORDS_regexp = Regexp.union(KEYWORDS) # Initialize once
N.times { TESTSTRINGS.each{|src| needfilter_regexp_init?(src) } }
}
b.report('split') { N.times { TESTSTRINGS.each{|src| needfilter_split?(src) } } }
} #Benchmark
Result:
Rehearsal -----------------------------------------------
orig 0.094000 0.000000 0.094000 ( 0.093750)
orig2 0.093000 0.000000 0.093000 ( 0.093750)
any 0.110000 0.000000 0.110000 ( 0.109375)
regexp 0.578000 0.000000 0.578000 ( 0.578125)
regexp_init 0.047000 0.000000 0.047000 ( 0.046875)
split 0.125000 0.000000 0.125000 ( 0.125000)
-------------------------------------- total: 1.047000sec
user system total real
orig 0.078000 0.000000 0.078000 ( 0.078125)
orig2 0.109000 0.000000 0.109000 ( 0.109375)
any 0.078000 0.000000 0.078000 ( 0.078125)
regexp 0.579000 0.000000 0.579000 ( 0.578125)
regexp_init 0.046000 0.000000 0.046000 ( 0.046875)
split 0.125000 0.000000 0.125000 ( 0.125000)
The solution with regular expressions is the fastest, if you create the regexp only once.
def need_filter?(src)
!!(src =~ /keyowrd_1|keyowrd_2|keyowrd_3|keyowrd_4|keyowrd_5/)
end
The =~ method returns a fixnum or nil. The double bang converts that to a boolean.
This is the way I'd do it:
def needfilter?(src)
keywords = Regexp.union("keyowrd_1","keyowrd_2","keyowrd_3","keyowrd_4","keyowrd_5")
!!(src =~ keywords)
end
This solution has:
No iteration
Single regexp using Regexp.union
Should be fast for even a large set of keywords. Note that hardcoding the keywords in the method is not ideal, but I'm assuming that was just for the sake of the example.
I think that
def need_filter?(src)
(src.split(/ /) & ["keyowrd_1","keyowrd_2","keyowrd_3","keyowrd_4","keyowrd_5"]).empty?
end
will work as you expect (as it's described in Array include any value from another array?) and will be faster than any? and include?.
I've often heard Ruby's inject method criticized as being "slow." As I rather like the function, and see equivalents in other languages, I'm curious if it's merely Ruby's implementation of the method that's slow, or if it is inherently a slow way to do things (e.g. should be avoided for non-small collections)?
inject is like fold, and can be very efficient in other languages, fold_left specifically, since it's tail-recursive.
It's mostly an implementation issue, but this gives you a good idea of the comparison:
$ ruby -v
ruby 1.8.7 (2008-08-11 patchlevel 72) [i486-linux]
$ ruby exp/each_v_inject.rb
Rehearsal -----------------------------------------------------
loop 0.000000 0.000000 0.000000 ( 0.000178)
fixnums each 0.790000 0.280000 1.070000 ( 1.078589)
fixnums each add 1.010000 0.290000 1.300000 ( 1.297733)
Enumerable#inject 1.900000 0.430000 2.330000 ( 2.330083)
-------------------------------------------- total: 4.700000sec
user system total real
loop 0.000000 0.000000 0.000000 ( 0.000178)
fixnums each 0.760000 0.300000 1.060000 ( 1.079252)
fixnums each add 1.030000 0.280000 1.310000 ( 1.305888)
Enumerable#inject 1.850000 0.490000 2.340000 ( 2.340341)
exp/each_v_inject.rb
require 'benchmark'
total = (ENV['TOTAL'] || 1_000).to_i
fixnums = Array.new(total) {|x| x}
Benchmark.bmbm do |x|
x.report("loop") do
total.times { }
end
x.report("fixnums each") do
total.times do |i|
fixnums.each {|x| x}
end
end
x.report("fixnums each add") do
total.times do |i|
v = 0
fixnums.each {|x| v += x}
end
end
x.report("Enumerable#inject") do
total.times do |i|
fixnums.inject(0) {|a,x| a + x }
end
end
end
So yes it is slow, but as improvements occur in the implementation it should become a non-issue. There is nothing inherent about WHAT it is doing that requires it to be slower.
each_with_object may be faster than inject, if you're mutating an existing object rather than creating a new object in each block.