I ran some benchmarks and was wondering why reversing a string and comparing it to itself seems to be faster than comparing individual characters.
Reversing a string is worst case O(n) and comparison O(n), resulting in O(n), unless comparing the same object, which should be O(1). But
str = "test"
str.reverse.object_id == str.object_id # => false
Is character comparison worst case O(1)? What am I missing?
Edit
I extracted and simplified for the question but here's the code I was running.
def reverse_compare(str)
str.reverse == str
end
def iterate_compare(str)
# can test with just str[0] != str[-1]
(str.length/2).times do |i|
return false if str[i] != str[-i-1]
end
end
require "benchmark"
n = 2000000
Benchmark.bm do |x|
str = "cabbbbbba" # best case single comparison
x.report("reverse_compare") { n.times do reverse_compare(str) ; a = "1"; end }
x.report("iterate_compare") { n.times do iterate_compare(str) ; a = "1"; end }
end
user system total real
reverse_compare 0.760000 0.000000 0.760000 ( 0.769463)
iterate_compare 1.840000 0.010000 1.850000 ( 1.855031)
There are two factors in favour of the reverse method:
Both String#reverse and String#== are written in pure C instead of ruby. Their inner loop already uses the fact that the length of the string is known, so there are no unnecessarry boundary checks.
String#[] however needs to check the string boundaries at every call. Also the main loop is written in ruby thereby being a tad bit slower as well. Also it will always create a new (one character long) string object as well to return which needs to be handled, GCd, etc.
It looks like these two factors have a bigger performance gain than what you get by a better algorithm, but which is done in ruby.
Also note that in your test you are not testing a random string, but a specific one, which is really short as well. If you woud try a larger one, then it is possible that the ruby implementaion will be quicker.
To proof some thoughts of #SztupY. If you change a bit your code, like this:
def reverse_compare(str)
str.reverse == str
end
def iterate_compare(a, b)
a != b
end
require "benchmark"
n = 2_000_000
Benchmark.bm do |x|
str = "cabbbbbba" # best case single comparison
a = str[0]; b = str[-1]
x.report("reverse_compare") { n.times do reverse_compare(str) ; end }
x.report("iterate_compare") { n.times do iterate_compare(a, b) ; end }
end
You will get a bit different result:
#> user system total real
#> reverse_compare 0.359000 0.000000 0.359000 ( 0.361493)
#> iterate_compare 0.187000 0.000000 0.187000 ( 0.201590)
So, you could guess now that it takes some time to create 2 string objects from String.
Related
I have an array of strings, stringarr, and want to know the length of the longest string. I'm not interested in the string itself.
My first solution was
maxlen = stringarr.max_by(&:size).size
Works, but ugly: I have to mention size twice, which is error-prone, and the size of the longest string needs to be caluclated twice. Well, no big deal with strings, but see below.
Another idea was
maxlen = stringarr.map(&:size).max
Cleaner from the viewpoint of readability, but needs to create a temporary array. Not good either.
Here another attempt:
maxlen = stringarr.inject(0) {|memo,s| [memo.size,s.size].max}
Not exactly beautiful either....
I wonder if there is a better approach. My wish would be something like
maxlen = stringarr.max_of(&:size) # Sadly no max_of in Ruby core
This would be of interest in particular, when I have a more complicated code block. This certainly is not good style:
maxlen = measurement(stringarr.max_by {|s| measurement(s)})
Any suggestions?
I think your inject method is probably the cleanest, but it has an error in it. You memoize the size but then you call .size on the memoized number.
Try this instead:
maxlen = stringarr.inject(0) {|memo,s| [memo,s.size].max}
Stole compiled #msergeant and #sebastián-palma:
require 'benchmark'
N = 10_000
stringarr = Array.new( N, 'Lorem Ipsum dolor sit amet' )
Benchmark.bm do |x|
x.report { stringarr.map(&:size).max }
x.report { stringarr.max_by(&:size).size }
x.report { stringarr.sort_by(&:size)[-1].size }
x.report { stringarr.inject(0) { |memo,s| [memo,s.size].max } }
x.report { stringarr.inject(0) { |memo,s| memo > s.size ? memo : s.size } }
end
My winner is #msergeant's variant with using ternary operator, when bare [memo,s.size].max is a slowest one. You can have different results on your environment and with another stringarr's data.
Based on the answers and comments I received, I decided to add a method to Enumerable for this purpose. I'm posting it here, in case someone wants to take benefit from it, or in contrary criticize it:
class Enumerable
# Similar to map(&block).max, but more efficient
#
# x.max_of {|s| f(s)} returns the largest value f(s), or nil if x is empty.
# x(seed).max_of {|s| f(s)} returns the largest value f(s), provided that it
# is larger than seed, or returns seed, if no such
# element exists or x is empty.
def max_of(seed = nil)
enumerator = each
result = nil
begin
if seed
result = seed
else
result = yield(enumerator.next)
end
loop do
next_val = yield(enumerator.next)
result = next_val if next_val > result
end
rescue StopIteration
end
result
end
end
For a string like
s = "(string1) this is text (string2) that's separated (string3)"
I need a way to remove all the parenthesis and text in them, however if I use the following it'll return an empty string
s.gsub(/\(.*\)/, "")
What can I use to get the following?
" this is text that's separated "
You could do the following:
s.gsub(/\(.*?\)/,'')
# => " this is text that's separated "
The ? in the regex is to make it "non-greedy". Without it, if:
s = "A: (string1) this is text (string2) that's separated (string3) B"
then
s.gsub(/\(.*\)/,'')
#=> "A: B"
Edit: I ran the following benchmarks for various methods. You will see that there is one important take-away.
n = 10_000_000
s = "(string1) this is text (string2) that's separated (string3)"
Benchmark.bm do |bm|
bm.report 'sawa' do
n.times { s.gsub(/\([^()]*\)/,'') }
end
bm.report 'cary' do
n.times { s.gsub(/\(.*?\)/,'') }
end
bm.report 'cary1' do
n.times { s.split(/\(.*?\)/).join }
end
bm.report 'sawa1' do
n.times { s.split(/\([^()]*\)/).join }
end
bm.report 'sawa!' do
n.times { s.gsub!(/\([^()]*\)/,'') }
end
bm.report '' do
n.times { s.gsub(/\([\w\s]*\)/, '') }
end
end
user system total real
sawa 37.110000 0.070000 37.180000 ( 37.182598)
cary 37.000000 0.060000 37.060000 ( 37.066398)
cary1 35.960000 0.050000 36.010000 ( 36.009534)
sawa1 36.450000 0.050000 36.500000 ( 36.503711)
sawa! 7.630000 0.000000 7.630000 ( 7.632278)
user1179871 38.500000 0.150000 38.650000 ( 38.666955)
I ran the benchmark several times and the results varied a fair bit. In some cases sawa was slightly faster than cary.
[Edit: I added a modified version of #user1179871's method to the benchmark above, but did not change any of the text of my answer. The modification is described in a comment on #user1179871's answer. It looks to be slightly slower that sawa and cary, but that may not be the case, as the benchmark times vary from run-to-run, and I did a separate benchmark of the new method.
Cary's answer is the simple way. This answer is the efficient way.
s.gsub(/\([^()]*\)/, "")
To keep in mind: Non-greedy matching requires backtracking, and in general, it is better not use it if you can. But for such simple task, Cary's answer is good enough.
Try it
string.gsub(/\({1}\w*\){1}/, '')
I've taken a stab at writing a method, but when my code isn't running and I'm not sure why.
str1 = "cored"
str2 = "coder"
def StringScramble(str1,str2)
numCombos = str1.length.downto(1).inject(:*)
arr = []
until arr.length == numCombos
shuffled = str1.split('').join
unless arr.include?(shuffled)
arr << shuffled
end
end
if arr.include?(str1)
return true
else
return false
end
end
Update: As #eugen pointed out in the comment, there's a much more efficient way:
str1.chars.sort == str2.chars.sort # => true
Original answer:
str1.chars.permutation.include?(str2.chars) # => true
Most efficient method?
Comparing sorted strings is certainly the easiest way, but you can one do better if efficiency is paramount? Last month #raph posted a comment that suggested an approach that sounded pretty good to me. I intended to benchmark it against the standard test, but never got around to it. The purpose of my answer is to benchmark the suggested approach against the standard one.
The challenger
The idea is create a counting hash h for the characters in one of the strings, so that h['c'] equals the number of times 'c' appears in the string. One then goes through the characters of the second string. Suppose 'c' is one of those characters. Then false is returned by the method if h.key?('c') => false or h['c'] == 0 (which can also be written h['c'].to_i == 0, as nil.to_i => 0); otherwise, the next character of the second string is checked against the hash. Assuming the strings are of equal length, they are anagrams of each other if and only if false has not been returned after all the characters of the second string have been checked. Creating the hash for the shorter of the two strings probably offers a further improvement. Here is my code for the method:
def hcompare(s1,s2)
return false unless s1.size == s2.size
# set `ss` to the shorter string, `sl` to the other.
ss, sl = (s1.size < s2.size) ? [s1, s2] : [s2, s1]
# create hash `h` with letter counts for the shorter string:
h = ss.chars.each_with_object(Hash.new(0)) { |c,h| h[c] += 1}
#decrement counts in `h` for characters in `sl`
sl.each_char { |c| return false if h[c].to_i == 0; h[c] -= 1 }
true
end
The incumbent
def scompare(s1,s2)
s1.chars.sort == s2.chars.sort
end
Helpers
methods = [:scompare, :hcompare]
def compute(m,s1,s2)
send(m,s1,s2)
end
def shuffle_chars(s)
s.chars.shuffle.join
end
Test data
reps = 20
ch = [*'b'..'z']
The benchmark
require 'benchmark'
[50000, 100000, 500000].each do |n|
t1 = Array.new(reps) { (Array.new(n) {ch.sample(1) }).join}
test_strings = { true=>t1.zip(t1.map {|s| shuffle_chars(s)})}
test_strings[false] = t1.zip(t1.map {|t| shuffle_chars((t[1..-1] << 'a'))})
puts "\nString length #{n}, #{reps} repetitions"
[true, false].each do |same|
puts "\nReturn #{same} "
Benchmark.bm(10) do |bm|
methods.each do |m|
bm.report m.to_s do
test_strings[same].each { |s1,s2| compute(m,s1,s2) }
end
end
end
end
end
Comparisons performed
I compared the two methods, scompare (uses sort) and hcompare (uses hash), performing the benchmark for three string lengths: 50,000, 100,000 and 500,000 characters. For each string length I created the first of two strings by selecting each character randomly from [*('b'..'z')]. I then created two strings to be compared with the first. One was merely a shuffling of the characters of the first string, so the methods would return true when those two strings are compared. In the second case I did the same, except I replaced a randomly selected character with 'a', so the methods would return false. These two cases are labelled true and false below.
Results
String length 50000, 20 repetitions
Return true
user system total real
scompare 0.620000 0.010000 0.630000 ( 0.625711)
hcompare 0.840000 0.010000 0.850000 ( 0.845548)
Return false
user system total real
scompare 0.530000 0.000000 0.530000 ( 0.532666)
hcompare 1.370000 0.000000 1.370000 ( 1.366293)
String length 100000, 20 repetitions
Return true
user system total real
scompare 1.420000 0.100000 1.520000 ( 1.516580)
hcompare 2.280000 0.010000 2.290000 ( 2.284189)
Return false
user system total real
scompare 1.020000 0.010000 1.030000 ( 1.034887)
hcompare 1.960000 0.000000 1.960000 ( 1.962655)
String length 500000, 20 repetitions
Return true
user system total real
scompare 10.310000 0.540000 10.850000 ( 10.850988)
hcompare 9.960000 0.180000 10.140000 ( 10.153366)
Return false
user system total real
scompare 8.120000 0.570000 8.690000 ( 8.687847)
hcompare 9.160000 0.030000 9.190000 ( 9.189997)
Conclusions
As you see, the method using the counting hash was superior to using sort in only one true case, when n => 500,000. Even there, the margin of victory was pretty small, much smaller than the relative differences in most of the other benchmark comparisons, where the standard method cruised to victory. While the hash counting method might have fared better with different tests, it seems that the conventional sorting method is hard to beat.
Was this answer of interest? I'm not sure, but since I had already done most of the work before seeing the results (which I expected would favour the counting hash), I decided to go ahead and put it out.
This method is taking over 7 seconds with 50 markets and 2,500 flows (~250,000 iterations). Why so slow?
def matrix
[:origin, :destination].collect do |location|
markets.collect do |market|
network.flows.collect { |flow| flow[location] == market ? 1 : 0 }
end
end.flatten
end
I know that the slowness comes from the comparison of one market to another market based on benchmarks that I've run.
Here are the relevant parts of the class that's being compared.
module FreightFlow
class Market
include ActiveAttr::Model
attribute :coordinates
def ==(value)
coordinates == value.coordinates
end
end
end
What's the best way to make this faster?
You are constructing 100 intermediate collections (2*50) comprising of a total of 250,000 (2*50*2500) elements, and then flattening it at the end. I would try constructing the whole data structure in one pass. Make sure that markets and network.flows are stored in a hash or set. Maybe something like:
def matrix
network.flows.collect do |flow|
(markets.has_key? flow[:origin] or
markets.has_key? flow[:destination]) ? 1 : 0
end
end
This is a simple thing but it can help...
In your innermost loop you're doing:
network.flows.collect { |flow| flow[location] == market ? 1 : 0 }
Instead of using the ternary statement to convert to 1 or 0, use true and false Booleans instead:
network.flows.collect { |flow| flow[location] == market }
This isn't a big difference in speed, but over the course of that many nested loops it adds up.
In addition, it allows you to simplify your tests using the matrix being generated. Instead of having to compare to 1 or 0, you can simplify your conditional tests to if flow[location], if !flow[location] or unless flow[location], again speeding up your application a little bit for each test. If those are deeply nested in loops, which is very likely, that little bit can add up again.
Something that is important to do, when speed is important, is use Ruby's Benchmark class to test various ways of doing the same task. Then, instead of guessing, you KNOW what works. You'll find lots of questions on Stack Overflow where I've supplied an answer that consists of a benchmark showing the speed differences between various ways of doing something. Sometimes the differences are very big. For instance:
require 'benchmark'
puts `ruby -v`
def test1()
true
end
def test2(p1)
true
end
def test3(p1, p2)
true
end
N = 10_000_000
Benchmark.bm(5) do |b|
b.report('?:') { N.times { (1 == 1) ? 1 : 0 } }
b.report('==') { N.times { (1 == 1) } }
b.report('if') {
N.times {
if (1 == 1)
1
else
0
end
}
}
end
Benchmark.bm(5) do |b|
b.report('test1') { N.times { test1() } }
b.report('test2') { N.times { test2('foo') } }
b.report('test3') { N.times { test3('foo', 'bar') } }
b.report('test4') { N.times { true } }
end
And the results:
ruby 1.9.3p392 (2013-02-22 revision 39386) [x86_64-darwin10.8.0]
user system total real
?: 1.880000 0.000000 1.880000 ( 1.878676)
== 1.780000 0.000000 1.780000 ( 1.785718)
if 1.920000 0.000000 1.920000 ( 1.914225)
user system total real
test1 2.760000 0.000000 2.760000 ( 2.760861)
test2 4.800000 0.000000 4.800000 ( 4.808184)
test3 6.920000 0.000000 6.920000 ( 6.915318)
test4 1.640000 0.000000 1.640000 ( 1.637506)
ruby 2.0.0p0 (2013-02-24 revision 39474) [x86_64-darwin10.8.0]
user system total real
?: 2.280000 0.000000 2.280000 ( 2.285408)
== 2.090000 0.010000 2.100000 ( 2.087504)
if 2.350000 0.000000 2.350000 ( 2.363972)
user system total real
test1 2.900000 0.010000 2.910000 ( 2.899922)
test2 7.070000 0.010000 7.080000 ( 7.092513)
test3 11.010000 0.030000 11.040000 ( 11.033432)
test4 1.660000 0.000000 1.660000 ( 1.667247)
There are two different sets of tests. The first is looking to see what the differences are with simple conditional tests vs. using == without a ternary to get just the Booleans. The second is to test the effect of calling a method, a method with a single parameter, and with two parameters, vs. "inline-code" to find out the cost of the setup and tear-down when calling a method.
Modern C compilers do some amazing things when they analyze the code prior to emitting the assembly language to be compiled. We can fine-tune them to write for size or speed. When we go for speed, the program grows as the compiler looks for loops it can unroll and places it can "inline" code, to avoid making the CPU jump around and throwing away stuff that's in the cache.
Ruby is much higher up the language chain, but some of the same ideas still apply. We can write in a very DRY manner, and avoid repetition and use methods and classes to abstract our data, but the cost is reduced processing speed. The answer is to write your code intelligently and don't waste CPU time and unroll/inline where necessary to gain speed and other times be DRY to make your code more maintainable.
It's all a balancing act, and there's a time for writing both ways.
Memoizing the indexes of the markets within the flows was way faster than any other solution. Time reduced from ~30 seconds when the question was asked to 0.6 seconds.
First, I added a flow_index in the Network class. It stores the indexes of the flows that contain the markets.
def flow_index
#flow_index ||= begin
flow_index = {}
[:origin, :destination].each do |location|
flow_index[location] = {}
flows.each { |flow| flow_index[location][flow[location]] = [] }
flows.each_with_index { |flow, i| flow_index[location][flow[location]] << i }
end
flow_index
end
end
Then, I refactored the matrix method to use the flow index.
def matrix
base_row = network.flows.count.times.collect { 0 }
[:origin, :destination].collect do |location|
markets.collect do |market|
row = base_row.dup
network.flow_index[location][market].each do |i|
row[i] = 1
end
row
end
end.flatten
end
The base_row is created with all 0s and you just replace with 1s at the locations from the flow_index for that market.
I'm definitely a newbie to ruby (and using 1.9.1), so any help is appreciated. Everything I've learned about Ruby has been from using google. I'm trying to compare two arrays of hashes and due to the sizes, it's taking way to long and flirts with running out of memory. Any help would be appreciated.
I have a Class (ParseCSV) with multiple methods (initialize, open, compare, strip, output).
The way I have it working right now is as follows (and this does pass the tests I've written, just using a much smaller data set):
file1 = ParseCSV.new(“some_file”)
file2 = ParseCSV.new(“some_other_file”)
file1.open #this reads the file contents into an Array of Hash’s through the CSV library
file1.strip #This is just removing extra hash’s from each array index. So normally there are fifty hash’s in each array index, this is just done to help reduce memory consumption.
file2.open
file2.compare(“file1.storage”) ##storage is The array of hash’s from the open method
file2.output
Now what I’m struggling with is the compare method. Working on smaller data sets it’s not a big deal at all, works fast enough. However in this case I’m comparing about 400,000 records (all read into the array of hashes) against one that has about 450,000 records. I’m trying to speed this up. Also I can’t run the strip method on file2. Here is how I’m doing it now:
def compare(x)
#obviously just a verbose message
puts "Comparing and leaving behind non matching entries"
x.each do |row|
##storage is the array of hashes
#storage.each_index do |y|
if row[#opts[:field]] == #storage[y][#opts[:field]]
#storage.delete_at(y)
end
end
end
end
Hopefully that makes sense. I know it’s going to be a slow process just because it has to iterate 400,000 rows 440,000 times each. But do you have any other ideas on how to speed it up and possibly reduce memory consumption?
Yikes, that'll be O(n^2) runtime. Nasty.
A better bet would be to use the built in Set class.
Code would look something like:
require 'set'
file1_content = load_file_content_into_array_here("some_file")
file2_content = load_file_content_into_array_here("some_other_file")
file1_set = Set[file1_content]
unique_elements = file1_set - file2_content
That assumes that the files themselves have unique content. Should work in the generic case, but may have quirks depending on what your data looks like and how you parse it, but as long as the lines can be compared with == it should help you out.
Using a set will be MUCH faster than doing a nested loop to iterate over the file content.
(and yes, I have actually done this to process files with ~2 million lines, so it should be able to handle your case - eventually. If you're doing heavy data munging, Ruby may not be the best choice of tool though)
Here's a script comparing two ways of doing it: Your original compare() and a new_compare(). The new_compare uses more of the built in Enumerable methods. Since they are implemented in C, they'll be faster.
I created a constant called Test::SIZE to try out the benchmarks with different hash sizes. Results at the bottom. The difference is huge.
require 'benchmark'
class Test
SIZE = 20000
attr_accessor :storage
def initialize
file1 = []
SIZE.times { |x| file1 << { :field => x, :foo => x } }
#storage = file1
#opts = {}
#opts[:field] = :field
end
def compare(x)
x.each do |row|
#storage.each_index do |y|
if row[#opts[:field]] == #storage[y][#opts[:field]]
#storage.delete_at(y)
end
end
end
end
def new_compare(other)
other_keys = other.map { |x| x[#opts[:field]] }
#storage.reject! { |s| other_keys.include? s[#opts[:field]] }
end
end
storage2 = []
# We'll make 10 of them match
10.times { |x| storage2 << { :field => x, :foo => x } }
# And the rest wont
(Test::SIZE-10).times { |x| storage2 << { :field => x+100000000, :foo => x} }
Benchmark.bm do |b|
b.report("original compare") do
t1 = Test.new
t1.compare(storage2)
end
end
Benchmark.bm do |b|
b.report("new compare") do
t1 = Test.new
t1.new_compare(storage2)
end
end
Results:
Test::SIZE = 500
user system total real
original compare 0.280000 0.000000 0.280000 ( 0.285366)
user system total real
new compare 0.020000 0.000000 0.020000 ( 0.020458)
Test::SIZE = 1000
user system total real
original compare 28.140000 0.110000 28.250000 ( 28.618907)
user system total real
new compare 1.930000 0.010000 1.940000 ( 1.956868)
Test::SIZE = 5000
ruby test.rb
user system total real
original compare113.100000 0.440000 113.540000 (115.041267)
user system total real
new compare 7.680000 0.020000 7.700000 ( 7.739120)
Test::SIZE = 10000
user system total real
original compare453.320000 1.760000 455.080000 (460.549246)
user system total real
new compare 30.840000 0.110000 30.950000 ( 31.226218)