I believe I have a good answer to this issue, but I wanted to make sure ruby-philes didn't have a much better way to do this.
Basically, given an input string, I would like to convert the string to an integer, where appropriate, or a float, where appropriate. Otherwise, just return the string.
I'll post my answer below, but I'd like to know if there is a better way out there.
Ex:
to_f_or_i_or_s("0523.49") #=> 523.49
to_f_or_i_or_s("0000029") #=> 29
to_f_or_i_or_s("kittens") #=> "kittens"
I would avoid using regex whenever possible in Ruby. It's notoriously slow.
def to_f_or_i_or_s(v)
((float = Float(v)) && (float % 1.0 == 0) ? float.to_i : float) rescue v
end
# Proof of Ruby's slow regex
def regex_float_detection(input)
input.match('\.')
end
def math_float_detection(input)
input % 1.0 == 0
end
n = 100_000
Benchmark.bm(30) do |x|
x.report("Regex") { n.times { regex_float_detection("1.1") } }
x.report("Math") { n.times { math_float_detection(1.1) } }
end
# user system total real
# Regex 0.180000 0.000000 0.180000 ( 0.181268)
# Math 0.050000 0.000000 0.050000 ( 0.048692)
A more comprehensive benchmark:
def wattsinabox(input)
input.match('\.').nil? ? Integer(input) : Float(input) rescue input.to_s
end
def jaredonline(input)
((float = Float(input)) && (float % 1.0 == 0) ? float.to_i : float) rescue input
end
def muistooshort(input)
case(input)
when /\A\s*[+-]?\d+\.\d+\z/
input.to_f
when /\A\s*[+-]?\d+(\.\d+)?[eE]\d+\z/
input.to_f
when /\A\s*[+-]?\d+\z/
input.to_i
else
input
end
end
n = 1_000_000
Benchmark.bm(30) do |x|
x.report("wattsinabox") { n.times { wattsinabox("1.1") } }
x.report("jaredonline") { n.times { jaredonline("1.1") } }
x.report("muistooshort") { n.times { muistooshort("1.1") } }
end
# user system total real
# wattsinabox 3.600000 0.020000 3.620000 ( 3.647055)
# jaredonline 1.400000 0.000000 1.400000 ( 1.413660)
# muistooshort 2.790000 0.010000 2.800000 ( 2.803939)
def to_f_or_i_or_s(v)
v.match('\.').nil? ? Integer(v) : Float(v) rescue v.to_s
end
A pile of regexes might be a good idea if you want to handle numbers in scientific notation (which String#to_f does):
def to_f_or_i_or_s(v)
case(v)
when /\A\s*[+-]?\d+\.\d+\z/
v.to_f
when /\A\s*[+-]?\d+(\.\d+)?[eE]\d+\z/
v.to_f
when /\A\s*[+-]?\d+\z/
v.to_i
else
v
end
end
You could mash both to_f cases into one regex if you wanted.
This will, of course, fail when fed '3,14159' in a locale that uses a comma as a decimal separator.
Depends on security requirements.
def to_f_or_i_or_s s
eval(s) rescue s
end
I used this method
def to_f_or_i_or_s(value)
return value if value[/[a-zA-Z]/]
i = value.to_i
f = value.to_f
i == f ? i : f
end
CSV has converters which do this.
require "csv"
strings = ["0523.49", "29","kittens"]
strings.each{|s|p s.parse_csv(converters: :numeric).first}
#523.49
#29
#"kittens"
However for some reason it converts "00029" to a float.
Related
Is there a way to simply check if a string value is a valid float value. Calling to_f on a string will convert it to 0.0 if it is not a numeric value. And using Float() raises an exception when it is passed an invalid float string which is closer to what I want, but I don't want to handle catching exceptions. What I really want is a method such as nan? which does exist in the Float class, but that doesn't help because a non-numeric string cannot be converted to a float without being changed to 0.0 (using to_f).
"a".to_f => 0.0
"a".to_f.nan? => false
Float("a") => ArgumentError: invalid value for Float(): "a"
Is there a simple solution for this or do I need to write code to check if a string is a valid float value?
Here's one way:
class String
def valid_float?
# The double negation turns this into an actual boolean true - if you're
# okay with "truthy" values (like 0.0), you can remove it.
!!Float(self) rescue false
end
end
"a".valid_float? #false
"2.4".valid_float? #true
If you want to avoid the monkey-patch of String, you could always make this a class method of some module you control, of course:
module MyUtils
def self.valid_float?(str)
!!Float(str) rescue false
end
end
MyUtils.valid_float?("a") #false
An interesting fact about the Ruby world is the existence of the Rubinius project, which implements Ruby and its standard library mostly in pure Ruby. As a result, they have a pure Ruby implementation of Kernel#Float, which looks like:
def Float(obj)
raise TypeError, "can't convert nil into Float" if obj.nil?
if obj.is_a?(String)
if obj !~ /^\s*[+-]?((\d+_?)*\d+(\.(\d+_?)*\d+)?|\.(\d+_?)*\d+)(\s*|([eE][+-]?(\d+_?)*\d+)\s*)$/
raise ArgumentError, "invalid value for Float(): #{obj.inspect}"
end
end
Type.coerce_to(obj, Float, :to_f)
end
This provides you with a regular expression that matches the internal work Ruby does when it runs Float(), but without the exception. So you could now do:
class String
def nan?
self !~ /^\s*[+-]?((\d+_?)*\d+(\.(\d+_?)*\d+)?|\.(\d+_?)*\d+)(\s*|([eE][+-]?(\d+_?)*\d+)\s*)$/
end
end
The nice thing about this solution is that since Rubinius runs, and passes RubySpec, you know this regex handles the edge-cases that Ruby itself handles, and you can call to_f on the String without any fear!
Ruby 2.6+
Ruby 2.6 added a new exception keyword argument to Float.
So, now to check whether a string contains a valid float is as simple as:
Float('22.241234', exception: false)
# => 22.241234
Float('abcd', exception: false)
# => nil
Here is a link to the docs.
# Edge Cases:
# numeric?"Infinity" => true is_numeric?"Infinity" => false
def numeric?(object)
true if Float(object) rescue false
end
#Possibly faster alternative
def is_numeric?(i)
i.to_i.to_s == i || i.to_f.to_s == i
end
I saw the unresolved discussion on cast+exceptions vs regex and I thought I would try to benchmark everything and produce an objective answer:
Here is the source for the best case and worst of each method attempted here:
require "benchmark"
n = 500000
def is_float?(fl)
!!Float(fl) rescue false
end
def is_float_reg(fl)
fl =~ /(^(\d+)(\.)?(\d+)?)|(^(\d+)?(\.)(\d+))/
end
class String
def to_float
Float self rescue (0.0 / 0.0)
end
end
Benchmark.bm(7) do |x|
x.report("Using cast best case") {
n.times do |i|
temp_fl = "#{i + 0.5}"
is_float?(temp_fl)
end
}
x.report("Using cast worst case") {
n.times do |i|
temp_fl = "asdf#{i + 0.5}"
is_float?(temp_fl)
end
}
x.report("Using cast2 best case") {
n.times do |i|
"#{i + 0.5}".to_float
end
}
x.report("Using cast2 worst case") {
n.times do |i|
"asdf#{i + 0.5}".to_float
end
}
x.report("Using regexp short") {
n.times do |i|
temp_fl = "#{i + 0.5}"
is_float_reg(temp_fl)
end
}
x.report("Using regexp long") {
n.times do |i|
temp_fl = "12340918234981234#{i + 0.5}"
is_float_reg(temp_fl)
end
}
x.report("Using regexp short fail") {
n.times do |i|
temp_fl = "asdf#{i + 0.5}"
is_float_reg(temp_fl)
end
}
x.report("Using regexp long fail") {
n.times do |i|
temp_fl = "12340918234981234#{i + 0.5}asdf"
is_float_reg(temp_fl)
end
}
end
With the following results with mri193:
user system total real
Using cast best case 0.608000 0.000000 0.608000 ( 0.615000)
Using cast worst case 5.647000 0.094000 5.741000 ( 5.745000)
Using cast2 best case 0.593000 0.000000 0.593000 ( 0.586000)
Using cast2 worst case 5.788000 0.047000 5.835000 ( 5.839000)
Using regexp short 0.951000 0.000000 0.951000 ( 0.952000)
Using regexp long 1.217000 0.000000 1.217000 ( 1.214000)
Using regexp short fail 1.201000 0.000000 1.201000 ( 1.202000)
Using regexp long fail 1.295000 0.000000 1.295000 ( 1.284000)
Since we are dealing with only Linear time algorithms I think we use empirical measurements to make generalizations. It is plain to see that regex is more consistent and will only fluctuate a bit based on the length of the string passed. The cast is clearly faster when there are no failure, and much slower when there are failures.
If we compare the success times we can see that the cast best case is about .3 seconds faster than regex best case. If we divide this by the amount of time in the case worst case we can estimate how many runs it would take to break even with exceptions slowing the cast down to match regex speeds. About 6 seconds dived by .3 gives us about 20. So if performance matters and you expect less than 1 in 20 of your test to fail then used cast+exceptions.
JRuby 1.7.4 has completely different results:
user system total real
Using cast best case 2.575000 0.000000 2.575000 ( 2.575000)
Using cast worst case 53.260000 0.000000 53.260000 ( 53.260000)
Using cast2 best case 2.375000 0.000000 2.375000 ( 2.375000)
Using cast2 worst case 53.822000 0.000000 53.822000 ( 53.822000)
Using regexp short 2.637000 0.000000 2.637000 ( 2.637000)
Using regexp long 3.395000 0.000000 3.395000 ( 3.396000)
Using regexp short fail 3.072000 0.000000 3.072000 ( 3.073000)
Using regexp long fail 3.375000 0.000000 3.375000 ( 3.374000)
Cast is only marginally faster in the best case (about 10%). Presuming this difference is appropriate for making generalizations(I do not think it is), then the break even point is somewhere between 200 and 250 runs with only 1 causing an exception.
So exceptions should only be used when truly exceptional things happens, this is decision for you and your codebase. When they aren't used code they are in can be simpler and faster.
If performance doesn't matter, you should probably just following whatever conventions you team or code base already has and ignore this whole answers.
def float?(string)
true if Float(string) rescue false
end
This supports 1.5, 5, 123.456, 1_000 but not 1 000, 1,000, etc. (e.g. same as String#to_f).
>> float?("1.2")
=> true
>> float?("1")
=> true
>> float?("1 000")
=> false
>> float?("abc")
=> false
>> float?("1_000")
=> true
Source: https://github.com/ruby/ruby/blob/trunk/object.c#L2934-L2959
Umm, if you don't want exceptions then perhaps:
def is_float?(fl)
fl =~ /(^(\d+)(\.)?(\d+)?)|(^(\d+)?(\.)(\d+))/
end
Since OP specifically asked for a solution without exceptions. Regexp based solution is marginally slow:
require "benchmark"
n = 500000
def is_float?(fl)
!!Float(fl) rescue false
end
def is_float_reg(fl)
fl =~ /(^(\d+)(\.)?(\d+)?)|(^(\d+)?(\.)(\d+))/
end
Benchmark.bm(7) do |x|
x.report("Using cast") {
n.times do |i|
temp_fl = "#{i + 0.5}"
is_float?(temp_fl)
end
}
x.report("using regexp") {
n.times do |i|
temp_fl = "#{i + 0.5}"
is_float_reg(temp_fl)
end
}
end
Results:
5286 snippets:master!? %
user system total real
Using cast 3.000000 0.000000 3.000000 ( 3.010926)
using regexp 5.020000 0.000000 5.020000 ( 5.021762)
Try this
def is_float(val)
fval = !!Float(val) rescue false
# if val is "1.50" for instance
# we need to lop off the trailing 0(s) with gsub else no match
return fval && Float(val).to_s == val.to_s.gsub(/0+$/,'') ? true:false
end
s = "1000"
is_float s
=> false
s = "1.5"
is_float s
=> true
s = "Bob"
is_float s
=> false
n = 1000
is_float n
=> false
n = 1.5
is_float n
=> true
I tried to add this as a comment but apparently there is no formatting in comments:
on the other hand, why not just use that as your conversion function, like
class String
def to_float
Float self rescue (0.0 / 0.0)
end
end
"a".to_float.nan? => true
which of course is what you didn't want to do in the first place. I guess the answer is, "you have to write your own function if you really don't want to use exception handling, but, why would you do that?"
I want find the length of a Fixnum, num, without converting it into a String.
In other words, how many digits are in num without calling the .to_s() method:
num.to_s.length
puts Math.log10(1234).to_i + 1 # => 4
You could add it to Fixnum like this:
class Fixnum
def num_digits
Math.log10(self).to_i + 1
end
end
puts 1234.num_digits # => 4
Ruby 2.4 has an Integer#digits method, which return an Array containing the digits.
num = 123456
num.digits
# => [6, 5, 4, 3, 2, 1]
num.digits.count
# => 6
EDIT:
To handle negative numbers (thanks #MatzFan), use the absolute value. Integer#abs
-123456.abs.digits
# => [6, 5, 4, 3, 2, 1]
Sidenote for Ruby 2.4+
I ran some benchmarks on the different solutions, and Math.log10(x).to_i + 1 is actually a lot faster than x.to_s.length. The comment from #Wayne Conrad is out of date. The new solution with digits.count is trailing far behind, especially with larger numbers:
with_10_digits = 2_040_240_420
print Benchmark.measure { 1_000_000.times { Math.log10(with_10_digits).to_i + 1 } }
# => 0.100000 0.000000 0.100000 ( 0.109846)
print Benchmark.measure { 1_000_000.times { with_10_digits.to_s.length } }
# => 0.360000 0.000000 0.360000 ( 0.362604)
print Benchmark.measure { 1_000_000.times { with_10_digits.digits.count } }
# => 0.690000 0.020000 0.710000 ( 0.717554)
with_42_digits = 750_325_442_042_020_572_057_420_745_037_450_237_570_322
print Benchmark.measure { 1_000_000.times { Math.log10(with_42_digits).to_i + 1 } }
# => 0.140000 0.000000 0.140000 ( 0.142757)
print Benchmark.measure { 1_000_000.times { with_42_digits.to_s.length } }
# => 1.180000 0.000000 1.180000 ( 1.186603)
print Benchmark.measure { 1_000_000.times { with_42_digits.digits.count } }
# => 8.480000 0.040000 8.520000 ( 8.577174)
Although the top-voted loop is nice, it isn't very Ruby and will be slow for large numbers, the .to_s is a built-in function and therefore will be much faster. ALMOST universally built-in functions will be far faster than constructed loops or iterators.
Another way:
def ndigits(n)
n=n.abs
(1..1.0/0).each { |i| return i if (n /= 10).zero? }
end
ndigits(1234) # => 4
ndigits(0) # => 1
ndigits(-123) # => 3
If you don't want to use regex, you can use this method:
def self.is_number(string_to_test)
is_number = false
# use to_f to handle float value and to_i for int
string_to_compare = string_to_test.to_i.to_s
string_to_compare_handle_end = string_to_test.to_i
# string has to be the same
if(string_to_compare == string_to_test)
is_number = true
end
# length for fixnum in ruby
size = Math.log10(string_to_compare_handle_end).to_i + 1
# size has to be the same
if(size != string_to_test.length)
is_number = false
end
is_number
end
You don't have to get fancy, you could do as simple as this.
def l(input)
output = 1
while input - (10**output) > 0
output += 1
end
return output
end
puts l(456)
It can be a solution to find out the length/count/size of a fixnum.
irb(main):004:0> x = 2021
=> 2021
irb(main):005:0> puts x.to_s.length
4
=> nil
irb(main):006:0>
What is the most efficient way to check if two hashes h1 and h2 have the same set of keys disregarding the order? Could it be made faster or more concise with close efficiency than the answer that I post?
Alright, let's break all rules of savoir vivre and portability. MRI's C API comes into play.
/* Name this file superhash.c. An appropriate Makefile is attached below. */
#include <ruby/ruby.h>
static int key_is_in_other(VALUE key, VALUE val, VALUE data) {
struct st_table *other = ((struct st_table**) data)[0];
if (st_lookup(other, key, 0)) {
return ST_CONTINUE;
} else {
int *failed = ((int**) data)[1];
*failed = 1;
return ST_STOP;
}
}
static VALUE hash_size(VALUE hash) {
if (!RHASH(hash)->ntbl)
return INT2FIX(0);
return INT2FIX(RHASH(hash)->ntbl->num_entries);
}
static VALUE same_keys(VALUE self, VALUE other) {
if (CLASS_OF(other) != rb_cHash)
rb_raise(rb_eArgError, "argument needs to be a hash");
if (hash_size(self) != hash_size(other))
return Qfalse;
if (!RHASH(other)->ntbl && !RHASH(other)->ntbl)
return Qtrue;
int failed = 0;
void *data[2] = { RHASH(other)->ntbl, &failed };
rb_hash_foreach(self, key_is_in_other, (VALUE) data);
return failed ? Qfalse : Qtrue;
}
void Init_superhash(void) {
rb_define_method(rb_cHash, "same_keys?", same_keys, 1);
}
Here's a Makefile.
CFLAGS=-std=c99 -O2 -Wall -fPIC $(shell pkg-config ruby-1.9 --cflags)
LDFLAGS=-Wl,-O1,--as-needed $(shell pkg-config ruby-1.9 --libs)
superhash.so: superhash.o
$(LINK.c) -shared $^ -o $#
An artificial, synthetic and simplistic benchmark shows what follows.
require 'superhash'
require 'benchmark'
n = 100_000
h1 = h2 = {a:5, b:8, c:1, d:9}
Benchmark.bm do |b|
# freemasonjson's state of the art.
b.report { n.times { h1.size == h2.size and h1.keys.all? { |key| !!h2[key] }}}
# This solution
b.report { n.times { h1.same_keys? h2} }
end
# user system total real
# 0.310000 0.000000 0.310000 ( 0.312249)
# 0.050000 0.000000 0.050000 ( 0.051807)
Combining freemasonjson's and sawa's ideas:
h1.size == h2.size and (h1.keys - h2.keys).empty?
Try:
# Check that both hash have the same number of entries first before anything
if h1.size == h2.size
# breaks from iteration and returns 'false' as soon as there is a mismatched key
# otherwise returns true
h1.keys.all?{ |key| !!h2[key] }
end
Enumerable#all?
worse case scenario, you'd only be iterating through the keys once.
Just for the sake of having at least a benchmark on this question...
require 'securerandom'
require 'benchmark'
a = {}
b = {}
# Use uuid to get a unique random key
(0..1_000).each do |i|
key = SecureRandom.uuid
a[key] = i
b[key] = i
end
Benchmark.bmbm do |x|
x.report("#-") do
1_000.times do
(a.keys - b.keys).empty? and (a.keys - b.keys).empty?
end
end
x.report("#&") do
1_000.times do
computed = a.keys & b.keys
computed.size == a.size
end
end
x.report("#all?") do
1_000.times do
a.keys.all?{ |key| !!b[key] }
end
end
x.report("#sort") do
1_000.times do
a_sorted = a.keys.sort
b_sorted = b.keys.sort
a == b
end
end
end
Results are:
Rehearsal -----------------------------------------
#- 1.000000 0.000000 1.000000 ( 1.001348)
#& 0.560000 0.000000 0.560000 ( 0.563523)
#all? 0.240000 0.000000 0.240000 ( 0.239058)
#sort 0.850000 0.010000 0.860000 ( 0.854839)
-------------------------------- total: 2.660000sec
user system total real
#- 0.980000 0.000000 0.980000 ( 0.976698)
#& 0.560000 0.000000 0.560000 ( 0.559592)
#all? 0.250000 0.000000 0.250000 ( 0.251128)
#sort 0.860000 0.000000 0.860000 ( 0.862857)
I have to agree with #akuhn that this would be a better benchmark if we had more information on the dataset you are using. But that being said, I believe this question really needed some hard fact.
It depends on your data.
There is no general case really. For example, generally retrieving the entire keyset at once is faster than checking inclusion of each key seperately. However, if in your dataset, the keysets differ more often than not, then a slower solution which fails faster might be faster. For example:
h1.size == h2.size and h1.keys.all?{|k|h2.include?(k)}
Another factor to consider is the size of your hashes. If they are big a solution with higher setup cost, like calling Set.new, might pay off, if however they are small, it won't:
h1.size == h2.size and Set.new(h1.keys) == Set.new(h2.keys)
And if you happen to compare the same immutable hashes over and over again, it would definitely pay off to cache the results.
Eventually only a benchmark will tell, but, to write a benchmark, we'd need to know more about your use case. For sure, testing a solution with synthetic data (as for example, randomly generated keys) will not be representative.
This is my try:
(h1.keys - h2.keys).empty? and (h2.keys - h1.keys).empty?
Here is my solution:
class Hash
# doesn't check recursively
def same_keys?(compare)
if compare.class == Hash
if self.size == compare.size
self.keys.all? {|s| compare.key?(s)}
else
return false
end
else
nil
end
end
end
a = c = { a: nil, b: "whatever1", c: 1.14, d: true }
b = { a: "foo", b: "whatever2", c: 2.14, "d": false }
d = { a: "bar", b: "whatever3", c: 3.14, }
puts a.same_keys?(b) # => true
puts a.same_keys?(c) # => true
puts a.same_keys?(d) # => false
puts a.same_keys?(false).inspect # => nil
puts a.same_keys?("jack").inspect # => nil
puts a.same_keys?({}).inspect # => false
I want to see if one string contains a keyword in a keyword list.
I have the following function:
def needfilter?(src)
["keyowrd_1","keyowrd_2","keyowrd_3","keyowrd_4","keyowrd_5"].each do |kw|
return true if src.include?(kw)
end
false
end
Can this code block be simplified in to one line sentence?
I know it can be simplified to:
def needfilter?(src)
!["keyowrd_1","keyowrd_2","keyowrd_3","keyowrd_4","keyowrd_5"].select{|c| src.include?(c)}.empty?
end
But this approach is not so efficient if the keyword array list is very long.
Looks like a nice use case for Enumerable#any? method:
def needfilter?(src)
["keyowrd_1","keyowrd_2","keyowrd_3","keyowrd_4","keyowrd_5"].any? do |kw|
src.include? kw
end
end
I was curious what's the fastest solution and I created a benchmark of all answers up to now.
I modified steenslag answer a bit. For tuning reasons I create the regexp only once not for each test.
require 'benchmark'
KEYWORDS = ["keyowrd_1","keyowrd_2","keyowrd_3","keyowrd_4","keyowrd_5"]
TESTSTRINGS = ['xx', 'xxx', "keyowrd_2"]
N = 10_000 #Number of Test loops
def needfilter_orig?(src)
["keyowrd_1","keyowrd_2","keyowrd_3","keyowrd_4","keyowrd_5"].each do |kw|
return true if src.include?(kw)
end
false
end
def needfilter_orig2?(src)
!["keyowrd_1","keyowrd_2","keyowrd_3","keyowrd_4","keyowrd_5"].select{|c| src.include?(c)}.empty?
end
def needfilter_any?(src)
["keyowrd_1","keyowrd_2","keyowrd_3","keyowrd_4","keyowrd_5"].any? do |kw|
src.include? kw
end
end
def needfilter_regexp?(src)
!!(src =~ Regexp.union(KEYWORDS))
end
def needfilter_regexp_init?(src)
!!(src =~ $KEYWORDS_regexp)
end
def needfilter_split?(src)
(src.split(/ /) & KEYWORDS).empty?
end
Benchmark.bmbm(10) {|b|
b.report('orig') { N.times { TESTSTRINGS.each{|src| needfilter_orig?(src)} } }
b.report('orig2') { N.times { TESTSTRINGS.each{|src| needfilter_orig2?(src) } } }
b.report('any') { N.times { TESTSTRINGS.each{|src| needfilter_any?(src) } } }
b.report('regexp') { N.times { TESTSTRINGS.each{|src| needfilter_regexp?(src) } } }
b.report('regexp_init') {
$KEYWORDS_regexp = Regexp.union(KEYWORDS) # Initialize once
N.times { TESTSTRINGS.each{|src| needfilter_regexp_init?(src) } }
}
b.report('split') { N.times { TESTSTRINGS.each{|src| needfilter_split?(src) } } }
} #Benchmark
Result:
Rehearsal -----------------------------------------------
orig 0.094000 0.000000 0.094000 ( 0.093750)
orig2 0.093000 0.000000 0.093000 ( 0.093750)
any 0.110000 0.000000 0.110000 ( 0.109375)
regexp 0.578000 0.000000 0.578000 ( 0.578125)
regexp_init 0.047000 0.000000 0.047000 ( 0.046875)
split 0.125000 0.000000 0.125000 ( 0.125000)
-------------------------------------- total: 1.047000sec
user system total real
orig 0.078000 0.000000 0.078000 ( 0.078125)
orig2 0.109000 0.000000 0.109000 ( 0.109375)
any 0.078000 0.000000 0.078000 ( 0.078125)
regexp 0.579000 0.000000 0.579000 ( 0.578125)
regexp_init 0.046000 0.000000 0.046000 ( 0.046875)
split 0.125000 0.000000 0.125000 ( 0.125000)
The solution with regular expressions is the fastest, if you create the regexp only once.
def need_filter?(src)
!!(src =~ /keyowrd_1|keyowrd_2|keyowrd_3|keyowrd_4|keyowrd_5/)
end
The =~ method returns a fixnum or nil. The double bang converts that to a boolean.
This is the way I'd do it:
def needfilter?(src)
keywords = Regexp.union("keyowrd_1","keyowrd_2","keyowrd_3","keyowrd_4","keyowrd_5")
!!(src =~ keywords)
end
This solution has:
No iteration
Single regexp using Regexp.union
Should be fast for even a large set of keywords. Note that hardcoding the keywords in the method is not ideal, but I'm assuming that was just for the sake of the example.
I think that
def need_filter?(src)
(src.split(/ /) & ["keyowrd_1","keyowrd_2","keyowrd_3","keyowrd_4","keyowrd_5"]).empty?
end
will work as you expect (as it's described in Array include any value from another array?) and will be faster than any? and include?.
What is the fastest way to read from STDIN a number of 1000000 characters (integers), and split it into an array of one character integers (not strings) ?
123456 > [1,2,3,4,5,6]
The quickest method I have found so far is as follows :-
gets.unpack("c*").map { |c| c-48}
Here are some results from benchmarking most of the provided solutions. These tests were run with a 100,000 digit file but with 10 reps for each test.
user system total real
each_char_full_array: 1.780000 0.010000 1.790000 ( 1.788893)
each_char_empty_array: 1.560000 0.010000 1.570000 ( 1.572162)
map_byte: 0.760000 0.010000 0.770000 ( 0.773848)
gets_scan 2.220000 0.030000 2.250000 ( 2.250076)
unpack: 0.510000 0.020000 0.530000 ( 0.529376)
And here is the code that produced them
#!/usr/bin/env ruby
require "benchmark"
MAX_ITERATIONS = 100000
FILE_NAME = "1_million_digits"
def build_test_file
File.open(FILE_NAME, "w") do |f|
MAX_ITERATIONS.times {|x| f.syswrite rand(10)}
end
end
def each_char_empty_array
STDIN.reopen(FILE_NAME)
a = []
STDIN.each_char do |c|
a << c.to_i
end
a
end
def each_char_full_array
STDIN.reopen(FILE_NAME)
a = Array.new(MAX_ITERATIONS)
idx = 0
STDIN.each_char do |c|
a[idx] = c.to_i
idx += 1
end
a
end
def map_byte()
STDIN.reopen(FILE_NAME)
a = STDIN.bytes.map { |c| c-48 }
a[-1] == -38 && a.pop
a
end
def gets_scan
STDIN.reopen(FILE_NAME)
gets.scan(/\d/).map(&:to_i)
end
def unpack
STDIN.reopen(FILE_NAME)
gets.unpack("c*").map { |c| c-48}
end
reps = 10
build_test_file
Benchmark.bm(10) do |x|
x.report("each_char_full_array: ") { reps.times {|y| each_char_full_array}}
x.report("each_char_empty_array:") { reps.times {|y| each_char_empty_array}}
x.report("map_byte: ") { reps.times {|y| map_byte}}
x.report("gets_scan ") { reps.times {|y| gets_scan}}
x.report("unpack: ") { reps.times {|y| unpack}}
end
This should be reasonably fast:
a = []
STDIN.each_char do |c|
a << c.to_i
end
although some rough benchmarking shows this hackish version is considerably faster:
a = STDIN.bytes.map { |c| c-48 }
scan(/\d/).map(&:to_i)
This will split any string into an array of integers, ignoring any non-numeric characters. If you want to grab user input from STDIN add gets:
gets.scan(/\d/).map(&:to_i)