Looking to clean up a small ruby script - ruby

I'm looking for a much more idiomatic way to do the following little ruby script.
File.open("channels.xml").each do |line|
if line.match('(mms:\/\/{1}[a-zA-Z\.\d\/\w-]+)')
puts line.match('(mms:\/\/{1}[a-zA-Z\.\d\/\w-]+)')
end
end
Thanks in advance for any suggestions.

The original:
File.open("channels.xml").each do |line|
if line.match('(mms:\/\/{1}[a-zA-Z\.\d\/\w-]+)')
puts line.match('(mms:\/\/{1}[a-zA-Z\.\d\/\w-]+)')
end
end
can be changed into this:
m = nil
open("channels.xml").each do |line|
puts m if m = line.match(%r|(mms://{1}[\w\./-]+)|)
end
File.open can be changed to just open.
if XYZ
puts XYZ
end
can be changed to puts x if x = XYZ as long as x has occurred at some place in the current scope before the if statement.
The Regexp '(mms:\/\/{1}[a-zA-Z\.\d\/\w-]+)' can be refactored a little bit. Using the %rXX notation, you can create regular expressions without the need for so many backslashes, where X is any matching character, such as ( and ) or in the example above, | |.
This character class [a-zA-Z\.\d\/\w-] (read: A to Z, case insensitive, the period character, 0 to 9, a forward slash, any word character, or a dash) is a little redundant. \w denotes "word characters", i.e. A-Za-z0-9 and underscore. Since you specify \w as a positive match, A-Za-z and \d are redundant.
Using those 2 cleanups, the Regexp can be changed into this: %r|(mms://{1}[\w\./-]+)|
If you'd like to avoid the weird m = nil scoping sorcery, this will also work, but is less idiomatic:
open("channels.xml").each do |line|
m = line.match(%r|(mms://{1}[\w\./-]+)|) and puts m
end
or the longer, but more readable version:
open("channels.xml").each do |line|
if m = line.match(%r|(mms://{1}[\w\./-]+)|)
puts m
end
end

One very easy to read approach is just to store the result of the match, then only print if there's a match:
File.open("channels.xml").each do |line|
m = line.match('(mms:\/\/{1}[a-zA-Z\.\d\/\w-]+)')
puts m if m
end
If you want to start getting clever (and have less-readable code), use $& which is the global variable that receives the match variable:
File.open("channels.xml").each do |line|
puts $& if line.match('(mms:\/\/{1}[a-zA-Z\.\d\/\w-]+)')
end

Personally, I would probably just use the POSIX grep command. But there is Enumerable#grep in Ruby, too:
puts File.readlines('channels.xml').grep(%r|mms://{1}[\w\./-]+|)
Alternatively, you could use some of Ruby's file and line processing magic that it inherited from Perl. If you pass the -p flag to the Ruby interpreter, it will assume that the script you pass in is wrapped with while gets; ...; end and at the end of each loop it will print the current line. You can then use the $_ special variable to access the current line and use the next keyword to skip iteration of the loop if you don't want the line printed:
ruby -pe 'next unless $_ =~ %r|mms://{1}[\w\./-]+|' channels.xml
Basically,
ruby -pe 'next unless $_ =~ /re/' file
is equivalent to
grep -E re file

Related

Regular expression - Ruby vs Perl

I noticed some extreme delays in my Ruby (1.9) scripts and after some digging it boiled down to regular expression matching. I'm using the following test scripts in Perl and in Ruby:
Perl:
$fname = shift(#ARGV);
open(FILE, "<$fname" );
while (<FILE>) {
if ( /(.*?) \|.*?SENDING REQUEST.*?TID=(.*?),/ ) {
print "$1: $2\n";
}
}
Ruby:
f = File.open( ARGV.shift )
while ( line = f.gets )
if /(.*?) \|.*?SENDING REQUEST.*?TID=(.*?),/.match(line)
puts "#{$1}: #{$2}"
end
end
I use the same input for both scripts, a file with only 44290 lines.
The timing for each one is:
Perl:
xenofon#cpm:~/bin/local/project$ time ./try.pl input >/dev/null
real 0m0.049s
user 0m0.040s
sys 0m0.000s
Ruby:
xenofon#cpm:~/bin/local/project$ time ./try.rb input >/dev/null
real 1m5.106s
user 1m4.910s
sys 0m0.010s
I guess I'm doing something awfully stupid, any suggestions?
Thank you
regex = Regexp.new(/(.*?) \|.*?SENDING REQUEST.*?TID=(.*?),/)
f = File.open( ARGV.shift ).each do |line|
if regex .match(line)
puts "#{$1}: #{$2}"
end
end
Or
regex = Regexp.new(/(.*?) \|.*?SENDING REQUEST.*?TID=(.*?),/)
f = File.open( ARGV.shift )
f.each_line do |line|
if regex.match(line)
puts "#{$1}: #{$2}"
end
One possible difference is the amount of backtracking being performed. Perl might do a better job of pruning the search tree when backtracking (i.e. noticing when part of a pattern can't possibly match). Its regex engine is highly optimised.
First, adding a leading «^» could make a huge difference. If the pattern doesn't match starting at position 0, it's not going to match at starting position 1 either! So don't try to match at position 1.
Along the same lines, «.*?» isn't as limiting as you might think, and replacing each instance of it with a more limiting pattern could prevent a lot of backtracking.
Why don't you try:
/
^
(.*?) [ ]\|
(?:(?!SENDING[ ]REQUEST).)* SENDING[ ]REQUEST
(?:(?!TID=).)* TID=
([^,]*) ,
/x
(Not sure if it was safe to replace the first «.*?» with «[^|]», so I didn't.)
(At least for patterns that match a single string, (?:(?!PAT).) is to PAT as [^CHAR] is to CHAR.)
Using /s could possibly speed things up if «.» is allowed to match newlines, but I think it's pretty minor.
Using «\space» instead of «[space]» to match a space under /x might be slightly faster in Ruby. (They're the same in recent versions of Perl.) I used the latter because it's far more readable.
From the perlretut chapter: Using regular expressions in Perl section - "Search and replace"
(Even though the regular expression appears in a loop, Perl is smart enough to compile it only once.)
I don't know Ruby very good, but I suspect that it does compile the regex in each cycle.
(Try the code from LaGrandMere's answer to verfiy it).
Try using the (?>re) Extension. See Ruby-Documentation for Details, here a Quote:
This construct [..] inhibits backtracking, which can be a
performance enhancement. For example, the pattern /a.*b.*a/ takes
exponential time when matched against a string containing an a
followed by a number of bs, but with no trailing a. However,
this can be avoided by using a nested regular expression
/a(?>.*b).*a/.
File.open(ARGV.shift) do |f|
while line = f.gets
if /(.*?)(?> \|.*?SENDING REQUEST.*?TID=)(.*?),/.match(line)
puts "#{$1}: #{$2}"
end
end
end
Ruby:
File.open(ARGV.shift).each do |line|
if line =~ /(.*?) \|.*?SENDING REQUEST.*?TID=(.*?),/
puts "#{$1}: #{$2}"
end
end
Change match method to =~ operator. It is faster because:
(Ruby has Benchmark. I don't know your file content so I randomly typed something)
require 'benchmark'
def bm(n)
Benchmark.bm do |x|
x.report{n.times{"asdfajdfaklsdjfklajdklfj".match(/fa/)}}
x.report{n.times{"asdfajdfaklsdjfklajdklfj" =~ /fa/}}
x.report{n.times{/fa/.match("asdfajdfaklsdjfklajdklfj")}}
end
end
bm(100000)
Output report:
user system total real
0.141000 0.000000 0.141000 ( 0.140564)
0.047000 0.000000 0.047000 ( 0.046855)
0.125000 0.000000 0.125000 ( 0.124945)
The middle one is using =~. It takes less than 1/3 of others. Other two are using match method. So, use =~ in your code.
Regular expression matching is time-consuming compared to other forms of matching. Since you are expecting a long, static string in the middle of your matching lines, try filtering out lines that don't include that string by using relatively-cheap string operations. That should result in less that needs to go through regular expression parsing (depending on what your input looks like, of course).
f = File.open( ARGV.shift )
my_re = Regexp.new(/(.*?) \|.*?SENDING REQUEST.*?TID=(.*?),/)
while ( line = f.gets )
continue if line.index('SENDING REQUEST') == nil
if my_re.match(line)
puts "#{$1}: #{$2}"
end
end
f.close()
I haven't benchmarked this particular version since I don't have your input data. I have had success doing things like this in the past, though, especially with lengthy logfiles where pre-filtering can eliminate the vast majority of the input without running any regular expressions.

Search and replace multiple words in file via Ruby

Good afternoon!
I am pretty new to Ruby and want to code a basic search and replace function in Ruby.
When you call the function, you can pass parameters (search pattern, replacing word).
This works like this: multiedit(pattern1, replacement1, pattern2, replacement2, ...)
Now, I want my function to read a text file, search for pattern1 and replace it with replacement2, search for pattern2 and replace it with replacement2 and so on. Finally, the altered text should be written to another text file.
I've tried to do this with a until loop, but all I get is that only the very first pattern is replaced while all the following patterns are ignored (in this example, only apple is replaced with fruit). I think the problem is that I always reread the original unaltered text? But I can't figure out a solution. Can you help me? Calling the function the way I am doing it is important for me.
def multiedit(*_patterns)
return puts "Number of search patterns does not match number of replacement strings!" if (_patterns.length % 2 > 0)
f = File.open("1.txt", "r")
g = File.open("2.txt", "w")
i = 0
until i >= _patterns.length do
f.each_line {|line|
output = line.sub(_patterns[i], _patterns[i+1])
g.puts output
}
i+=2
end
f.close
g.close
end
multiedit("apple", "fruit", "tomato", "veggie", "steak", "meat")
Can you help me out?
Thank you very much in advance!
Regards
Your loop was kind of inside-out ... do this instead ...
f.each_line do |line|
_patterns.each_slice 2 do |a, b|
line.sub! a, b
end
g.puts line
end
Perhaps the most efficient way to evaluate all the patterns for every line is to build a single regexp from all the search patterns and use the hash replacement form of String#gsub
def multiedit *patterns
raise ArgumentError, "Number of search patterns does not match number of replacement strings!" if (_patterns.length % 2 != 0)
replacements = Hash[ *patterns ].
regexp = Regexp.new replacements.keys.map {|k| Regexp.quote(k) }.join('|')
File.open("2.txt", "w") do |out|
IO.foreach("1.txt") do |line|
out.puts line.gsub regexp, replacements
end
end
end
Easier and better method is to use erb.
http://apidock.com/ruby/ERB

Ruby: Use condition result in condition block

I have such code
reg = /(.+)_path/
if reg.match('home_path')
puts reg.match('home_path')[0]
end
This will eval regex twice :(
So...
reg = /(.+)_path/
result = reg.match('home_path')
if result
puts result[0]
end
But it will store variable result in memory till.
I have one functional-programming idea
/(.+)_path/.match('home_path').compact.each do |match|
puts match[0]
end
But seems there should be better solution, isn't it?
There are special global variables (their names start with $) that contain results of the last regexp match:
r = /(.+)_path/
# $1 - the n-th group of the last successful match (may be > 1)
puts $1 if r.match('home_path')
# => home
# $& - the string matched by the last successful match
puts $& if r.match('home_path')
# => home_path
You can find full list of predefined global variables here.
Note, that in the examples above puts won't be executed at all if you pass a string that doesn't match the regexp.
And speaking about general case you can always put assignment into condition itself:
if m = /(.+)_path/.match('home_path')
puts m[0]
end
Though, many people don't like that as it makes code less readable and gives a good opportunity for confusing = and ==.
My personal favorite (w/ 1.9+) is some variation of:
if /(?<prefix>.+)_path/ =~ "home_path"
puts prefix
end
If you really want a one-liner: puts /(?<prefix>.+)_path/ =~ 'home_path' ? prefix : false
See the Ruby Docs for a few limitations of named captures and #=~.
From the docs: If a block is given, invoke the block with MatchData if match succeed.
So:
/(.+)_path/.match('home_path') { |m| puts m[1] } # => home
/(.+)_path/.match('homepath') { |m| puts m[1] } # prints nothing
How about...
if m=/regex here/.match(string) then puts m[0] end
A neat one-line solution, I guess :)
how about this ?
puts $~ if /regex/.match("string")
$~ is a special variable that stores the last regexp match. more info: http://www.regular-expressions.info/ruby.html
Actually, this can be done with no conditionals at all. (The expression evaluates to "" if there is no match.)
puts /(.+)_path/.match('home_xath').to_a[0].to_s

Can using the ruby flip-flop as a filter be made less kludgy?

In order to get part of text, I'm using a true if kludge in front of a flip-flop:
desired_portion_lines = text.each_line.find_all do |line|
true if line =~ /start_regex/ .. line =~ /finish_regex/
end
desired_portion = desired_portion_lines.join
If I remove the true if bit, it complains
bad value for range (ArgumentError)
Is it possible to make it less kludgy, or should I merely do
desired_portion_lines = ""
text.each_line do |line|
desired_portion_lines << line if line =~ /start_regex/ .. line =~ /finish_regex/
end
Or is there a better approach that doesn't use enumeration?
if you are doing it line by line, my preference is something like this
line =~ /finish_regex/ && p=0
line =~ /start_regex/ && p=1
puts line if p
if you have all in one string. I would use split
mystring.split(/finish_regex/).each do |item|
if item[/start_regex/]
puts item.split(/start_regex/)[-1]
end
end
I think
desired_portion_lines = ""
text.each_line do |line|
desired_portion_lines << line if line =~ /start_regex/ .. line =~ /finish_regex/
end
is perfectly acceptable. The .. operator is very powerful, but not used by a lot of people, probably because they don't understand what it does. Possibly it looks weird or awkward to you because you're not used to using it, but it'll grow on you. It's very common in Perl when dealing with ranges of lines in text files, which is where I first encountered it, and eventually was using it a lot.
The only thing I'd do differently is add some parenthesis to visually separate the logical tests from each other, and from the rest of the line:
desired_portion_lines = ""
text.each_line do |line|
desired_portion_lines << line if ( (line =~ /start_regex/) .. (line =~ /finish_regex/) )
end
Ruby (and Perl) coders seem to abhor using parenthesis, but I consider them useful for visually separating the logic tests. For me it's a readability and, by extension, a maintenance thing.
The only other thing I can think of that might help, would be to change desired_portion_lines to an array, and push your selected lines onto it. Currently, using desired_portion_lines << line appends to the string, mutating it each time. It might be faster pushing on the array then joining its elements afterward to build your string.
Back to the first example. I didn't test this but I think you can simplify it to:
desired_portion = text.each_line.find_all { |line| line =~ /start_regex/ .. line =~ /finish_regex/ }.join
The only downside to iterating over all lines in a file using the flip-flop, is that if the start-pattern can occur multiple times, you'll get each found block added to desired_portion.
You can save three characters by replacing true if with !!() (with the flip flop belonging in between the parentheses).

Create regular expression from string

Is there any way to create the regex /func:\[sync\] displayPTS/ from string func:[sync] displayPTS?
The story behind this question is that I have serval string pattens to search against in a text file and I don't want to write the same thing again and again.
File.open($f).readlines.reject {|l| not l =~ /"#{string1}"/}
File.open($f).readlines.reject {|l| not l =~ /"#{string2}"/}
Instead , I want to have a function to do the job:
def filter string
#build the reg pattern from string
File.open($f).readlines.reject {|l| not l =~ pattern}
end
filter string1
filter string2
s = "func:[sync] displayPTS"
# => "func:[sync] displayPTS"
r = Regexp.new(s)
# => /func:[sync] displayPTS/
r = Regexp.new(Regexp.escape(s))
# => /func:\[sync\]\ displayPTS/
I like Bob's answer, but just to save the time on your keyboard:
string = 'func:\[sync] displayPTS'
/#{string}/
If the strings are just strings, you can combine them into one regular expression, like so:
targets = [
"string1",
"string2",
].collect do |s|
Regexp.escape(s)
end.join('|')
targets = Regexp.new(targets)
And then:
lines = File.readlines('/tmp/bar').reject do |line|
line !~ target
end
s !~ regexp is equivalent to not s =~ regexp, but easier to read.
Avoid using File.open without closing the file. The file will remain open until the discarded file object is garbage collected, which could be long enough that your program will run out of file handles. If you need to do more than just read the lines, then:
File.open(path) do |file|
# do stuff with file
end
Ruby will close the file at the end of the block.
You might also consider whether using find_all and a positive match would be easier to read than reject and a negative match. The fewer negatives the reader's mind has to go through, the clearer the code:
lines = File.readlines('/tmp/bar').find_all do |line|
line =~ target
end
How about using %r{}:
my_regex = "func:[sync] displayPTS"
File.open($f).readlines.reject { |l| not l =~ %r{#{my_regex}} }

Resources