Ruby how to remove repeated regex in string - ruby

For a string like
s = "(string1) this is text (string2) that's separated (string3)"
I need a way to remove all the parenthesis and text in them, however if I use the following it'll return an empty string
s.gsub(/\(.*\)/, "")
What can I use to get the following?
" this is text that's separated "

You could do the following:
s.gsub(/\(.*?\)/,'')
# => " this is text that's separated "
The ? in the regex is to make it "non-greedy". Without it, if:
s = "A: (string1) this is text (string2) that's separated (string3) B"
then
s.gsub(/\(.*\)/,'')
#=> "A: B"
Edit: I ran the following benchmarks for various methods. You will see that there is one important take-away.
n = 10_000_000
s = "(string1) this is text (string2) that's separated (string3)"
Benchmark.bm do |bm|
bm.report 'sawa' do
n.times { s.gsub(/\([^()]*\)/,'') }
end
bm.report 'cary' do
n.times { s.gsub(/\(.*?\)/,'') }
end
bm.report 'cary1' do
n.times { s.split(/\(.*?\)/).join }
end
bm.report 'sawa1' do
n.times { s.split(/\([^()]*\)/).join }
end
bm.report 'sawa!' do
n.times { s.gsub!(/\([^()]*\)/,'') }
end
bm.report '' do
n.times { s.gsub(/\([\w\s]*\)/, '') }
end
end
user system total real
sawa 37.110000 0.070000 37.180000 ( 37.182598)
cary 37.000000 0.060000 37.060000 ( 37.066398)
cary1 35.960000 0.050000 36.010000 ( 36.009534)
sawa1 36.450000 0.050000 36.500000 ( 36.503711)
sawa! 7.630000 0.000000 7.630000 ( 7.632278)
user1179871 38.500000 0.150000 38.650000 ( 38.666955)
I ran the benchmark several times and the results varied a fair bit. In some cases sawa was slightly faster than cary.
[Edit: I added a modified version of #user1179871's method to the benchmark above, but did not change any of the text of my answer. The modification is described in a comment on #user1179871's answer. It looks to be slightly slower that sawa and cary, but that may not be the case, as the benchmark times vary from run-to-run, and I did a separate benchmark of the new method.

Cary's answer is the simple way. This answer is the efficient way.
s.gsub(/\([^()]*\)/, "")
To keep in mind: Non-greedy matching requires backtracking, and in general, it is better not use it if you can. But for such simple task, Cary's answer is good enough.

Try it
string.gsub(/\({1}\w*\){1}/, '')

Related

Why is `str.reverse == str` faster than `str[0] != str[-1]`?

I ran some benchmarks and was wondering why reversing a string and comparing it to itself seems to be faster than comparing individual characters.
Reversing a string is worst case O(n) and comparison O(n), resulting in O(n), unless comparing the same object, which should be O(1). But
str = "test"
str.reverse.object_id == str.object_id # => false
Is character comparison worst case O(1)? What am I missing?
Edit
I extracted and simplified for the question but here's the code I was running.
def reverse_compare(str)
str.reverse == str
end
def iterate_compare(str)
# can test with just str[0] != str[-1]
(str.length/2).times do |i|
return false if str[i] != str[-i-1]
end
end
require "benchmark"
n = 2000000
Benchmark.bm do |x|
str = "cabbbbbba" # best case single comparison
x.report("reverse_compare") { n.times do reverse_compare(str) ; a = "1"; end }
x.report("iterate_compare") { n.times do iterate_compare(str) ; a = "1"; end }
end
user system total real
reverse_compare 0.760000 0.000000 0.760000 ( 0.769463)
iterate_compare 1.840000 0.010000 1.850000 ( 1.855031)
There are two factors in favour of the reverse method:
Both String#reverse and String#== are written in pure C instead of ruby. Their inner loop already uses the fact that the length of the string is known, so there are no unnecessarry boundary checks.
String#[] however needs to check the string boundaries at every call. Also the main loop is written in ruby thereby being a tad bit slower as well. Also it will always create a new (one character long) string object as well to return which needs to be handled, GCd, etc.
It looks like these two factors have a bigger performance gain than what you get by a better algorithm, but which is done in ruby.
Also note that in your test you are not testing a random string, but a specific one, which is really short as well. If you woud try a larger one, then it is possible that the ruby implementaion will be quicker.
To proof some thoughts of #SztupY. If you change a bit your code, like this:
def reverse_compare(str)
str.reverse == str
end
def iterate_compare(a, b)
a != b
end
require "benchmark"
n = 2_000_000
Benchmark.bm do |x|
str = "cabbbbbba" # best case single comparison
a = str[0]; b = str[-1]
x.report("reverse_compare") { n.times do reverse_compare(str) ; end }
x.report("iterate_compare") { n.times do iterate_compare(a, b) ; end }
end
You will get a bit different result:
#> user system total real
#> reverse_compare 0.359000 0.000000 0.359000 ( 0.361493)
#> iterate_compare 0.187000 0.000000 0.187000 ( 0.201590)
So, you could guess now that it takes some time to create 2 string objects from String.

How to create a method that checks if string1 can be rearranged to equal string2?

I've taken a stab at writing a method, but when my code isn't running and I'm not sure why.
str1 = "cored"
str2 = "coder"
def StringScramble(str1,str2)
numCombos = str1.length.downto(1).inject(:*)
arr = []
until arr.length == numCombos
shuffled = str1.split('').join
unless arr.include?(shuffled)
arr << shuffled
end
end
if arr.include?(str1)
return true
else
return false
end
end
Update: As #eugen pointed out in the comment, there's a much more efficient way:
str1.chars.sort == str2.chars.sort # => true
Original answer:
str1.chars.permutation.include?(str2.chars) # => true
Most efficient method?
Comparing sorted strings is certainly the easiest way, but you can one do better if efficiency is paramount? Last month #raph posted a comment that suggested an approach that sounded pretty good to me. I intended to benchmark it against the standard test, but never got around to it. The purpose of my answer is to benchmark the suggested approach against the standard one.
The challenger
The idea is create a counting hash h for the characters in one of the strings, so that h['c'] equals the number of times 'c' appears in the string. One then goes through the characters of the second string. Suppose 'c' is one of those characters. Then false is returned by the method if h.key?('c') => false or h['c'] == 0 (which can also be written h['c'].to_i == 0, as nil.to_i => 0); otherwise, the next character of the second string is checked against the hash. Assuming the strings are of equal length, they are anagrams of each other if and only if false has not been returned after all the characters of the second string have been checked. Creating the hash for the shorter of the two strings probably offers a further improvement. Here is my code for the method:
def hcompare(s1,s2)
return false unless s1.size == s2.size
# set `ss` to the shorter string, `sl` to the other.
ss, sl = (s1.size < s2.size) ? [s1, s2] : [s2, s1]
# create hash `h` with letter counts for the shorter string:
h = ss.chars.each_with_object(Hash.new(0)) { |c,h| h[c] += 1}
#decrement counts in `h` for characters in `sl`
sl.each_char { |c| return false if h[c].to_i == 0; h[c] -= 1 }
true
end
The incumbent
def scompare(s1,s2)
s1.chars.sort == s2.chars.sort
end
Helpers
methods = [:scompare, :hcompare]
def compute(m,s1,s2)
send(m,s1,s2)
end
def shuffle_chars(s)
s.chars.shuffle.join
end
Test data
reps = 20
ch = [*'b'..'z']
The benchmark
require 'benchmark'
[50000, 100000, 500000].each do |n|
t1 = Array.new(reps) { (Array.new(n) {ch.sample(1) }).join}
test_strings = { true=>t1.zip(t1.map {|s| shuffle_chars(s)})}
test_strings[false] = t1.zip(t1.map {|t| shuffle_chars((t[1..-1] << 'a'))})
puts "\nString length #{n}, #{reps} repetitions"
[true, false].each do |same|
puts "\nReturn #{same} "
Benchmark.bm(10) do |bm|
methods.each do |m|
bm.report m.to_s do
test_strings[same].each { |s1,s2| compute(m,s1,s2) }
end
end
end
end
end
Comparisons performed
I compared the two methods, scompare (uses sort) and hcompare (uses hash), performing the benchmark for three string lengths: 50,000, 100,000 and 500,000 characters. For each string length I created the first of two strings by selecting each character randomly from [*('b'..'z')]. I then created two strings to be compared with the first. One was merely a shuffling of the characters of the first string, so the methods would return true when those two strings are compared. In the second case I did the same, except I replaced a randomly selected character with 'a', so the methods would return false. These two cases are labelled true and false below.
Results
String length 50000, 20 repetitions
Return true
user system total real
scompare 0.620000 0.010000 0.630000 ( 0.625711)
hcompare 0.840000 0.010000 0.850000 ( 0.845548)
Return false
user system total real
scompare 0.530000 0.000000 0.530000 ( 0.532666)
hcompare 1.370000 0.000000 1.370000 ( 1.366293)
String length 100000, 20 repetitions
Return true
user system total real
scompare 1.420000 0.100000 1.520000 ( 1.516580)
hcompare 2.280000 0.010000 2.290000 ( 2.284189)
Return false
user system total real
scompare 1.020000 0.010000 1.030000 ( 1.034887)
hcompare 1.960000 0.000000 1.960000 ( 1.962655)
String length 500000, 20 repetitions
Return true
user system total real
scompare 10.310000 0.540000 10.850000 ( 10.850988)
hcompare 9.960000 0.180000 10.140000 ( 10.153366)
Return false
user system total real
scompare 8.120000 0.570000 8.690000 ( 8.687847)
hcompare 9.160000 0.030000 9.190000 ( 9.189997)
Conclusions
As you see, the method using the counting hash was superior to using sort in only one true case, when n => 500,000. Even there, the margin of victory was pretty small, much smaller than the relative differences in most of the other benchmark comparisons, where the standard method cruised to victory. While the hash counting method might have fared better with different tests, it seems that the conventional sorting method is hard to beat.
Was this answer of interest? I'm not sure, but since I had already done most of the work before seeing the results (which I expected would favour the counting hash), I decided to go ahead and put it out.

Why is this array building method so slow?

This method is taking over 7 seconds with 50 markets and 2,500 flows (~250,000 iterations). Why so slow?
def matrix
[:origin, :destination].collect do |location|
markets.collect do |market|
network.flows.collect { |flow| flow[location] == market ? 1 : 0 }
end
end.flatten
end
I know that the slowness comes from the comparison of one market to another market based on benchmarks that I've run.
Here are the relevant parts of the class that's being compared.
module FreightFlow
class Market
include ActiveAttr::Model
attribute :coordinates
def ==(value)
coordinates == value.coordinates
end
end
end
What's the best way to make this faster?
You are constructing 100 intermediate collections (2*50) comprising of a total of 250,000 (2*50*2500) elements, and then flattening it at the end. I would try constructing the whole data structure in one pass. Make sure that markets and network.flows are stored in a hash or set. Maybe something like:
def matrix
network.flows.collect do |flow|
(markets.h­as_key? flow[:origin] or
marke­ts.has_key­? flow[:destination]) ? 1 : 0
end
end
This is a simple thing but it can help...
In your innermost loop you're doing:
network.flows.collect { |flow| flow[location] == market ? 1 : 0 }
Instead of using the ternary statement to convert to 1 or 0, use true and false Booleans instead:
network.flows.collect { |flow| flow[location] == market }
This isn't a big difference in speed, but over the course of that many nested loops it adds up.
In addition, it allows you to simplify your tests using the matrix being generated. Instead of having to compare to 1 or 0, you can simplify your conditional tests to if flow[location], if !flow[location] or unless flow[location], again speeding up your application a little bit for each test. If those are deeply nested in loops, which is very likely, that little bit can add up again.
Something that is important to do, when speed is important, is use Ruby's Benchmark class to test various ways of doing the same task. Then, instead of guessing, you KNOW what works. You'll find lots of questions on Stack Overflow where I've supplied an answer that consists of a benchmark showing the speed differences between various ways of doing something. Sometimes the differences are very big. For instance:
require 'benchmark'
puts `ruby -v`
def test1()
true
end
def test2(p1)
true
end
def test3(p1, p2)
true
end
N = 10_000_000
Benchmark.bm(5) do |b|
b.report('?:') { N.times { (1 == 1) ? 1 : 0 } }
b.report('==') { N.times { (1 == 1) } }
b.report('if') {
N.times {
if (1 == 1)
1
else
0
end
}
}
end
Benchmark.bm(5) do |b|
b.report('test1') { N.times { test1() } }
b.report('test2') { N.times { test2('foo') } }
b.report('test3') { N.times { test3('foo', 'bar') } }
b.report('test4') { N.times { true } }
end
And the results:
ruby 1.9.3p392 (2013-02-22 revision 39386) [x86_64-darwin10.8.0]
user system total real
?: 1.880000 0.000000 1.880000 ( 1.878676)
== 1.780000 0.000000 1.780000 ( 1.785718)
if 1.920000 0.000000 1.920000 ( 1.914225)
user system total real
test1 2.760000 0.000000 2.760000 ( 2.760861)
test2 4.800000 0.000000 4.800000 ( 4.808184)
test3 6.920000 0.000000 6.920000 ( 6.915318)
test4 1.640000 0.000000 1.640000 ( 1.637506)
ruby 2.0.0p0 (2013-02-24 revision 39474) [x86_64-darwin10.8.0]
user system total real
?: 2.280000 0.000000 2.280000 ( 2.285408)
== 2.090000 0.010000 2.100000 ( 2.087504)
if 2.350000 0.000000 2.350000 ( 2.363972)
user system total real
test1 2.900000 0.010000 2.910000 ( 2.899922)
test2 7.070000 0.010000 7.080000 ( 7.092513)
test3 11.010000 0.030000 11.040000 ( 11.033432)
test4 1.660000 0.000000 1.660000 ( 1.667247)
There are two different sets of tests. The first is looking to see what the differences are with simple conditional tests vs. using == without a ternary to get just the Booleans. The second is to test the effect of calling a method, a method with a single parameter, and with two parameters, vs. "inline-code" to find out the cost of the setup and tear-down when calling a method.
Modern C compilers do some amazing things when they analyze the code prior to emitting the assembly language to be compiled. We can fine-tune them to write for size or speed. When we go for speed, the program grows as the compiler looks for loops it can unroll and places it can "inline" code, to avoid making the CPU jump around and throwing away stuff that's in the cache.
Ruby is much higher up the language chain, but some of the same ideas still apply. We can write in a very DRY manner, and avoid repetition and use methods and classes to abstract our data, but the cost is reduced processing speed. The answer is to write your code intelligently and don't waste CPU time and unroll/inline where necessary to gain speed and other times be DRY to make your code more maintainable.
It's all a balancing act, and there's a time for writing both ways.
Memoizing the indexes of the markets within the flows was way faster than any other solution. Time reduced from ~30 seconds when the question was asked to 0.6 seconds.
First, I added a flow_index in the Network class. It stores the indexes of the flows that contain the markets.
def flow_index
#flow_index ||= begin
flow_index = {}
[:origin, :destination].each do |location|
flow_index[location] = {}
flows.each { |flow| flow_index[location][flow[location]] = [] }
flows.each_with_index { |flow, i| flow_index[location][flow[location]] << i }
end
flow_index
end
end
Then, I refactored the matrix method to use the flow index.
def matrix
base_row = network.flows.count.times.collect { 0 }
[:origin, :destination].collect do |location|
markets.collect do |market|
row = base_row.dup
network.flow_index[location][market].each do |i|
row[i] = 1
end
row
end
end.flatten
end
The base_row is created with all 0s and you just replace with 1s at the locations from the flow_index for that market.

What is the most performant way of processing this large text file?

When I read a text file into memory it brings my text in with '\n' at the end due to the new lines.
["Hello\n", "my\n", "name\n", "is\n", "John\n"]
Here is how I am reading the text file
array = File.readlines('text_file.txt')
I need to do a lot of processing on this text array, so I'm wondering if I should remove the "\n" when I first create the array, or when I do the processing on each element with regex, performance wise.
I wrote some (admittedly bad) test code to remove the "\n"
array = []
File.open('text_file.txt', "r").each_line do |line|
data = line.split(/\n/)
array << data
end
array.flatten!
Is there a better way to do this if I should remove the "\n" when I first create the array?
If I wanted to read the file into a Set instead(for performance), is there a method similar to readlines to do that?
You need to run a benchmark test, using Ruby's built-in Benchmark to figure out what is your fastest choice.
However, from experience, I've found that "slurping" the file, i.e., reading it all in at once, is not any faster than using a loop with IO.foreach or File.foreach. This is because Ruby and the underlying OS do file buffering as the reads occur, allowing your loop to occur from memory, not directly from disk. foreach will not strip the line-terminators for you, like split would, so you'll need to add a chomp or chomp! if you want to mutate the line read in:
File.foreach('/path/to/file') do |li|
puts li.chomp
end
or
File.foreach('/path/to/file') do |li|
li.chomp!
puts li
end
Also, slurping has the problem of not being scalable; You could end up trying to read a file bigger than memory, taking your machine to its knees, while reading line-by-line will never do that.
Here's some performance numbers:
#!/usr/bin/env ruby
require 'benchmark'
require 'fileutils'
FILENAME = 'test.txt'
LOOPS = 1
puts "Ruby Version: #{RUBY_VERSION}"
puts "Filesize being read: #{File.size(FILENAME)}"
puts "Lines in file: #{`wc -l #{FILENAME}`.split.first}"
Benchmark.bm(20) do |x|
x.report('read.split') { LOOPS.times { File.read(FILENAME).split("\n") }}
x.report('read.lines.chomp') { LOOPS.times { File.read(FILENAME).lines.map(&:chomp) }}
x.report('readlines.map.chomp1') { LOOPS.times { File.readlines(FILENAME).map(&:chomp) }}
x.report('readlines.map.chomp2') { LOOPS.times { File.readlines(FILENAME).map{ |s| s.chomp } }}
x.report('foreach.map.chomp1') { LOOPS.times { File.foreach(FILENAME).map(&:chomp) }}
x.report('foreach.map.chomp2') { LOOPS.times { File.foreach(FILENAME).map{ |s| s.chomp } }}
end
And the results:
Ruby Version: 1.9.3
Filesize being read: 42026131
Lines in file: 465440
user system total real
read.split 0.150000 0.060000 0.210000 ( 0.213365)
read.lines.chomp 0.470000 0.070000 0.540000 ( 0.541266)
readlines.map.chomp1 0.450000 0.090000 0.540000 ( 0.535465)
readlines.map.chomp2 0.550000 0.060000 0.610000 ( 0.616674)
foreach.map.chomp1 0.580000 0.060000 0.640000 ( 0.641563)
foreach.map.chomp2 0.620000 0.050000 0.670000 ( 0.662912)
On today's machines a 42MB file can be read into RAM pretty safely. I have seen files a lot bigger than that which won't fit into the memory of some of our production hosts. While foreach is slower, it's also not going to take a machine to its knees by sucking up all memory if there isn't enough memory.
On Ruby 1.9.3, using the map(&:chomp) method, instead of the older form of map { |s| s.chomp }, is a lot faster. That wasn't true with older versions of Ruby, so caveat emptor.
Also, note that all the above processed the data in less than one second on my several years old Mac Pro. All in all I'd say that worrying about the load speed is premature optimization, and the real problem will be what is done after the data is loaded.
I'd use String#chomp:
lines = open('text_file.txt').lines.map(&:chomp)
If you want to get rid of the ending newline character you can either String#chomp or String#rstrip. My preferred method would be chomp.
So you can easily do something like:
lines.map! { |line| line.chomp }
# or
lines.map! { |line| line.rstrip }
mvelez#argo:~$ cat test.txt
Hello
my
name
is
John
One liner:
arr = File.open("test.txt",'r').read.split
Decomposing this in irb
irb(main):002:0> f = File.open("test.txt",'r')
=> #<File:test.txt>
irb(main):003:0> file_contents = f.read
=> "Hello\nmy\nname\nis\nJohn\n\n"
irb(main):004:0> file_contents.split
=> ["Hello", "my", "name", "is", "John"]
I'ld prefer using strip over split in these cases, and doing it right after handling the line for the first time. Using split after readline is overkill imo. So the code snippet would be
array = []
File.open('text_file.txt', "r").each_line do |line|
array << data.strip
end

Infinite yields from an iterator

I'm trying to learn some ruby.
Imagine I'm looping and doing a long running process, and in this process I want to get a spinner for as long as necessary.
So I could do:
a=['|','/','-','\\']
aNow=0
# ... skip setup a big loop
print a[aNow]
aNow += 1
aNow = 0 if aNow == a.length
# ... do next step of process
print "\b"
But I thought it'd be cleaner to do:
def spinChar
a=['|','/','-','\\']
a.cycle{|x| yield x}
end
# ... skip setup a big loop
print spinChar
# ... do next step of process
print "\b"
Of course the spinChar call wants a block. If I give it a block it'll hang indefinitely.
How can I get just the next yeild of this block?
Ruby's yield does not work in the way your example would like. But this might be a good place for a closure:
def spinner()
state = ['|','/','-','\\']
return proc { state.push(state.shift)[0] }
end
spin = spinner
# start working
print spin.call
# more work
print spin.call
# etc...
In practice I think this solution might be too "clever" for its own good, but understanding the idea of Procs could be useful anyhow.
I like all these suggestions, but I found the Generator in the standard library, and I think it's more along the lines of what I wanted to do:
spinChar=Generator.new{ |g|
['|','/','-','\\'].cycle{ |x|
g.yield x
}
}
#then
spinChar.next
#will indefinitly output the next character.
Plain array index increments with modulus on a frozen array seems to be fastest.
Vlad's thread is nifty but not exactly what I wanted. And in spinner class the one-line increment would be nicer if Ruby supported i++ like GLYPHS[#i++%GLYPHS.length]
Max's spinner closure with push shift seems a little intensive to me, but the resulting syntax is almost exactly like this Generator. At least I think that's a closure with proc in there.
Chuck's with_spinner is actually pretty close to what I wanted, but why break if you don't have to with a Generator as above.
Vadim, thanks for pointing out the generator would be slow.
"Here's a test of 50,000 spins:"
user system total real
"array index" 0.050000 0.000000 0.050000 ( 0.055520)
"spinner class" 0.100000 0.010000 0.110000 ( 0.105594)
"spinner closure" 0.080000 0.030000 0.110000 ( 0.116068)
"with_spinner" 0.070000 0.030000 0.100000 ( 0.095915)
"generator" 6.710000 0.320000 7.030000 ( 7.304569)
I think you were on the right track with cycle. How about something like this:
1.8.7 :001 > spinner = ['|','/','-','\\'].cycle
=> #<Enumerable::Enumerator:0x7f111c165790>
1.8.7 :002 > spinner.next
=> "|"
1.8.7 :003 > spinner.next
=> "/"
1.8.7 :004 > spinner.next
=> "-"
1.8.7 :005 > spinner.next
=> "\\"
1.8.7 :006 > spinner.next
=> "|"
I don't think you quite understand what yield does in Ruby. It doesn't return a value from a block — it passes a value to the block you've passed to the enclosing method.
I think you want something more like this:
def with_spinner
a=['|','/','-','\\']
a.cycle do |x|
print x
$stdout.flush # needed because standard out is usually buffered
yield # this will call the do-block you pass to with_spinner
end
end
with_spinner do
#process here
#break when done
end
Once upon a time, I wrote an array. But it's not just an array, it's an array that has a pointer, so you can call next foreverrr!
http://gist.github.com/55955
Pair this class with a simple iterator or loop and you are golden.
a = Collection.new(:a, :b, :c)
1000.times do |i|
puts a.current
a.next
end
Your code is a bit inside-out, if you'll pardon me saying so. :)
Why not:
class Spinner
GLYPHS=['|','/','-','\\']
def budge
print "#{GLYPHS[#idx = ((#idx || 0) + 1) % GLYPHS.length]}\b"
end
end
spinner = Spinner.new
spinner.budge
# do something
spinner.budge
spinner.budge
# do something else
spinner.budge
Now, if you want something like:
with_spinner do
# do my big task here
end
...then you'd have to use multi-threading:
def with_spinner
t = Thread.new do
['|','/','-','\\'].cycle { |c| print "#{c}\b" ; sleep(1) }
end
yield
Thread.kill(t) # nasty but effective
end
hehe, the answer above mine is all dirty.
a=['|','/','-','\\']
a << a
a.each {|item| puts item}

Resources