Regular expression - Ruby vs Perl - ruby

I noticed some extreme delays in my Ruby (1.9) scripts and after some digging it boiled down to regular expression matching. I'm using the following test scripts in Perl and in Ruby:
Perl:
$fname = shift(#ARGV);
open(FILE, "<$fname" );
while (<FILE>) {
if ( /(.*?) \|.*?SENDING REQUEST.*?TID=(.*?),/ ) {
print "$1: $2\n";
}
}
Ruby:
f = File.open( ARGV.shift )
while ( line = f.gets )
if /(.*?) \|.*?SENDING REQUEST.*?TID=(.*?),/.match(line)
puts "#{$1}: #{$2}"
end
end
I use the same input for both scripts, a file with only 44290 lines.
The timing for each one is:
Perl:
xenofon#cpm:~/bin/local/project$ time ./try.pl input >/dev/null
real 0m0.049s
user 0m0.040s
sys 0m0.000s
Ruby:
xenofon#cpm:~/bin/local/project$ time ./try.rb input >/dev/null
real 1m5.106s
user 1m4.910s
sys 0m0.010s
I guess I'm doing something awfully stupid, any suggestions?
Thank you

regex = Regexp.new(/(.*?) \|.*?SENDING REQUEST.*?TID=(.*?),/)
f = File.open( ARGV.shift ).each do |line|
if regex .match(line)
puts "#{$1}: #{$2}"
end
end
Or
regex = Regexp.new(/(.*?) \|.*?SENDING REQUEST.*?TID=(.*?),/)
f = File.open( ARGV.shift )
f.each_line do |line|
if regex.match(line)
puts "#{$1}: #{$2}"
end

One possible difference is the amount of backtracking being performed. Perl might do a better job of pruning the search tree when backtracking (i.e. noticing when part of a pattern can't possibly match). Its regex engine is highly optimised.
First, adding a leading «^» could make a huge difference. If the pattern doesn't match starting at position 0, it's not going to match at starting position 1 either! So don't try to match at position 1.
Along the same lines, «.*?» isn't as limiting as you might think, and replacing each instance of it with a more limiting pattern could prevent a lot of backtracking.
Why don't you try:
/
^
(.*?) [ ]\|
(?:(?!SENDING[ ]REQUEST).)* SENDING[ ]REQUEST
(?:(?!TID=).)* TID=
([^,]*) ,
/x
(Not sure if it was safe to replace the first «.*?» with «[^|]», so I didn't.)
(At least for patterns that match a single string, (?:(?!PAT).) is to PAT as [^CHAR] is to CHAR.)
Using /s could possibly speed things up if «.» is allowed to match newlines, but I think it's pretty minor.
Using «\space» instead of «[space]» to match a space under /x might be slightly faster in Ruby. (They're the same in recent versions of Perl.) I used the latter because it's far more readable.

From the perlretut chapter: Using regular expressions in Perl section - "Search and replace"
(Even though the regular expression appears in a loop, Perl is smart enough to compile it only once.)
I don't know Ruby very good, but I suspect that it does compile the regex in each cycle.
(Try the code from LaGrandMere's answer to verfiy it).

Try using the (?>re) Extension. See Ruby-Documentation for Details, here a Quote:
This construct [..] inhibits backtracking, which can be a
performance enhancement. For example, the pattern /a.*b.*a/ takes
exponential time when matched against a string containing an a
followed by a number of bs, but with no trailing a. However,
this can be avoided by using a nested regular expression
/a(?>.*b).*a/.
File.open(ARGV.shift) do |f|
while line = f.gets
if /(.*?)(?> \|.*?SENDING REQUEST.*?TID=)(.*?),/.match(line)
puts "#{$1}: #{$2}"
end
end
end

Ruby:
File.open(ARGV.shift).each do |line|
if line =~ /(.*?) \|.*?SENDING REQUEST.*?TID=(.*?),/
puts "#{$1}: #{$2}"
end
end
Change match method to =~ operator. It is faster because:
(Ruby has Benchmark. I don't know your file content so I randomly typed something)
require 'benchmark'
def bm(n)
Benchmark.bm do |x|
x.report{n.times{"asdfajdfaklsdjfklajdklfj".match(/fa/)}}
x.report{n.times{"asdfajdfaklsdjfklajdklfj" =~ /fa/}}
x.report{n.times{/fa/.match("asdfajdfaklsdjfklajdklfj")}}
end
end
bm(100000)
Output report:
user system total real
0.141000 0.000000 0.141000 ( 0.140564)
0.047000 0.000000 0.047000 ( 0.046855)
0.125000 0.000000 0.125000 ( 0.124945)
The middle one is using =~. It takes less than 1/3 of others. Other two are using match method. So, use =~ in your code.

Regular expression matching is time-consuming compared to other forms of matching. Since you are expecting a long, static string in the middle of your matching lines, try filtering out lines that don't include that string by using relatively-cheap string operations. That should result in less that needs to go through regular expression parsing (depending on what your input looks like, of course).
f = File.open( ARGV.shift )
my_re = Regexp.new(/(.*?) \|.*?SENDING REQUEST.*?TID=(.*?),/)
while ( line = f.gets )
continue if line.index('SENDING REQUEST') == nil
if my_re.match(line)
puts "#{$1}: #{$2}"
end
end
f.close()
I haven't benchmarked this particular version since I don't have your input data. I have had success doing things like this in the past, though, especially with lengthy logfiles where pre-filtering can eliminate the vast majority of the input without running any regular expressions.

Related

See if the beginning of a line matches a regex character

There are lines inside a file that contain !. I need all other lines. I only want to print lines within the file that do not start with an exclamation mark.
The line of code which I have written so far is:
unless parts.each_line.split("\n" =~ /^!/)
# other bit of nested code
end
But it doesn't work. How do I do it?
As a start I'd use:
File.foreach('foo.txt') do |li|
next if li[0] == '!'
puts li
end
foreach is extremely fast and allows your code to handle any size file - "scalable" is the term. See "Why is "slurping" a file not a good practice?" for more information.
li[0] is a common idiom in Ruby to get the first character of a string. Again, it's very fast and is my favorite way to get there, however consider these tests:
require 'fruity'
STR = '!' + ('a'..'z').to_a.join # => "!abcdefghijklmnopqrstuvwxyz"
compare do
_slice { STR[0] == '!' }
_start_with { STR.start_with?('!') }
_regex { !!STR[/^!/] }
end
# >> Running each test 32768 times. Test will take about 2 seconds.
# >> _start_with is faster than _slice by 2x ± 1.0
# >> _slice is similar to _regex
Using start_with? (or its String end equivalent end_with?) is twice as fast and it looks like I'll be using start_with? and end_with? from now on.
Combine that with foreach and your code will have a decent chance of being fast and efficient.
See "What is the fastest way to compare the start or end of a String with a sub-string using Ruby?" for more information.
You can use string#start_with to find the lines that start with a particular string.
file = File.open('file.txt').read
file.each_line do |line|
unless line.start_with?('!')
print line
end
end
You can also check the index of the first character
unless line[0] === "!"
You can also do this with Regex
unless line.scan(/^!/).length

Optimising ruby regexp -- lots of match groups

I'm working on a ruby baser lexer. To improve performance, I joined up all tokens' regexps into one big regexp with match group names. The resulting regexp looks like:
/\A(?<__anonymous_-1038694222803470993>(?-mix:\n+))|\A(?<__anonymous_-1394418499721420065>(?-mix:\/\/[\A\n]*))|\A(?<__anonymous_3077187815313752157>(?-mix:include\s+"[\A"]+"))|\A(?<LET>(?-mix:let\s))|\A(?<IN>(?-mix:in\s))|\A(?<CLASS>(?-mix:class\s))|\A(?<DEF>(?-mix:def\s))|\A(?<DEFM>(?-mix:defm\s))|\A(?<MULTICLASS>(?-mix:multiclass\s))|\A(?<FUNCNAME>(?-mix:![a-zA-Z_][a-zA-Z0-9_]*))|\A(?<ID>(?-mix:[a-zA-Z_][a-zA-Z0-9_]*))|\A(?<STRING>(?-mix:"[\A"]*"))|\A(?<NUMBER>(?-mix:[0-9]+))/
I'm matching it to my string producing a MatchData where exactly one token is parsed:
bigregex =~ "\n ... garbage"
puts $~.inspect
Which outputs
#<MatchData
"\n"
__anonymous_-1038694222803470993:"\n"
__anonymous_-1394418499721420065:nil
__anonymous_3077187815313752157:nil
LET:nil
IN:nil
CLASS:nil
DEF:nil
DEFM:nil
MULTICLASS:nil
FUNCNAME:nil
ID:nil
STRING:nil
NUMBER:nil>
So, the regex actually matched the "\n" part. Now, I need to figure the match group where it belongs (it's clearly visible from #inspect output that it's _anonymous-1038694222803470993, but I need to get it programmatically).
I could not find any option other than iterating over #names:
m.names.each do |n|
if m[n]
type = n.to_sym
resolved_type = (n.start_with?('__anonymous_') ? nil : type)
val = m[n]
break
end
end
which verifies that the match group did have a match.
The problem here is that it's slow (I spend about 10% of time in the loop; also 8% grabbing the #input[#pos..-1] to make sure that \A works as expected to match start of string (I do not discard input, just shift the #pos in it).
You can check the full code at GH repo.
Any ideas on how to make it at least a bit faster? Is there any option to figure the "successful" match group easier?
You can do this using the regexp methods .captures() and .names():
matching_string = "\n ...garbage" # or whatever this really is in your code
#input = matching_string.match bigregex # bigregex = your regex
arr = #input.captures
arr.each_with_index do |value, index|
if not value.nil?
the_name_you_want = #input.names[index]
end
end
Or if you expect multiple successful values, you could do:
success_names_arr = []
success_names_arr.push(#input.names[index]) #within the above loop
Pretty similar to your original idea, but if you're looking for efficiency .captures() method should help with that.
I may have misunderstood this completely but but I'm assuming that all but one token is not nil and that's the one your after?
If so then, depending on the flavour of regex you're using, you could use a negative lookahead to check for a non-nil value
([^\n:]+:(?!nil)[^\n\>]+)
This will match the whole token ie NAME:value.

What does the o modifier for a regexp mean?

Ruby regexp has some options (e.g. i, x, m, o). i means ignore case, for instance.
What does the o option mean? In ri Regexp, it says o means to perform #{} interpolation only once. But when I do this:
a = 'one'
b = /#{a}/
a = 'two'
b does not change (it stays /one/). What am I missing?
Straight from the go-to source for regular expressions:
/o causes any #{...} substitutions in a particular regex literal to be performed just once, the first time it is evaluated. Otherwise, the substitutions will be performed every time the literal generates a Regexp object.
I could also turn up this usage example:
# avoid interpolating patterns like this if the pattern
# isn't going to change:
pattern = ARGV.shift
ARGF.each do |line|
print line if line =~ /#{pattern}/
end
# the above creates a new regex each iteration. Instead,
# use the /o modifier so the regex is compiled only once
pattern = ARGV.shift
ARGF.each do |line|
print line if line =~ /#{pattern}/o
end
So I guess this is rather a thing for the compiler, for a single line that is executed multiple times.

Ruby, remove last N characters from a string?

What is the preferred way of removing the last n characters from a string?
irb> 'now is the time'[0...-4]
=> "now is the "
If the characters you want to remove are always the same characters, then consider chomp:
'abc123'.chomp('123') # => "abc"
The advantages of chomp are: no counting, and the code more clearly communicates what it is doing.
With no arguments, chomp removes the DOS or Unix line ending, if either is present:
"abc\n".chomp # => "abc"
"abc\r\n".chomp # => "abc"
From the comments, there was a question of the speed of using #chomp versus using a range. Here is a benchmark comparing the two:
require 'benchmark'
S = 'asdfghjkl'
SL = S.length
T = 10_000
A = 1_000.times.map { |n| "#{n}#{S}" }
GC.disable
Benchmark.bmbm do |x|
x.report('chomp') { T.times { A.each { |s| s.chomp(S) } } }
x.report('range') { T.times { A.each { |s| s[0...-SL] } } }
end
Benchmark Results (using CRuby 2.13p242):
Rehearsal -----------------------------------------
chomp 1.540000 0.040000 1.580000 ( 1.587908)
range 1.810000 0.200000 2.010000 ( 2.011846)
-------------------------------- total: 3.590000sec
user system total real
chomp 1.550000 0.070000 1.620000 ( 1.610362)
range 1.970000 0.170000 2.140000 ( 2.146682)
So chomp is faster than using a range, by ~22%.
Ruby 2.5+
As of Ruby 2.5 you can use delete_suffix or delete_suffix! to achieve this in a fast and readable manner.
The docs on the methods are here.
If you know what the suffix is, this is idiomatic (and I'd argue, even more readable than other answers here):
'abc123'.delete_suffix('123') # => "abc"
'abc123'.delete_suffix!('123') # => "abc"
It's even significantly faster (almost 40% with the bang method) than the top answer. Here's the result of the same benchmark:
user system total real
chomp 0.949823 0.001025 0.950848 ( 0.951941)
range 1.874237 0.001472 1.875709 ( 1.876820)
delete_suffix 0.721699 0.000945 0.722644 ( 0.723410)
delete_suffix! 0.650042 0.000714 0.650756 ( 0.651332)
I hope this is useful - note the method doesn't currently accept a regex so if you don't know the suffix it's not viable for the time being. However, as the accepted answer (update: at the time of writing) dictates the same, I thought this might be useful to some people.
str = str[0..-1-n]
Unlike the [0...-n], this handles the case of n=0.
I would suggest chop. I think it has been mentioned in one of the comments but without links or explanations so here's why I think it's better:
It simply removes the last character from a string and you don't have to specify any values for that to happen.
If you need to remove more than one character then chomp is your best bet. This is what the ruby docs have to say about chop:
Returns a new String with the last character removed. If the string
ends with \r\n, both characters are removed. Applying chop to an empty
string returns an empty string. String#chomp is often a safer
alternative, as it leaves the string unchanged if it doesn’t end in a
record separator.
Although this is used mostly to remove separators such as \r\n I've used it to remove the last character from a simple string, for example the s to make the word singular.
name = "my text"
x.times do name.chop! end
Here in the console:
>name = "Nabucodonosor"
=> "Nabucodonosor"
> 7.times do name.chop! end
=> 7
> name
=> "Nabuco"
Dropping the last n characters is the same as keeping the first length - n characters.
Active Support includes String#first and String#last methods which provide a convenient way to keep or drop the first/last n characters:
require 'active_support/core_ext/string/access'
"foobarbaz".first(3) # => "foo"
"foobarbaz".first(-3) # => "foobar"
"foobarbaz".last(3) # => "baz"
"foobarbaz".last(-3) # => "barbaz"
if you are using rails, try:
"my_string".last(2) # => "ng"
[EDITED]
To get the string WITHOUT the last 2 chars:
n = "my_string".size
"my_string"[0..n-3] # => "my_stri"
Note: the last string char is at n-1. So, to remove the last 2, we use n-3.
Check out the slice() method:
http://ruby-doc.org/core-2.5.0/String.html#method-i-slice
You can always use something like
"string".sub!(/.{X}$/,'')
Where X is the number of characters to remove.
Or with assigning/using the result:
myvar = "string"[0..-X]
where X is the number of characters plus one to remove.
If you're ok with creating class methods and want the characters you chop off, try this:
class String
def chop_multiple(amount)
amount.times.inject([self, '']){ |(s, r)| [s.chop, r.prepend(s[-1])] }
end
end
hello, world = "hello world".chop_multiple 5
hello #=> 'hello '
world #=> 'world'
Using regex:
str = 'string'
n = 2 #to remove last n characters
str[/\A.{#{str.size-n}}/] #=> "stri"
x = "my_test"
last_char = x.split('').last

Looking to clean up a small ruby script

I'm looking for a much more idiomatic way to do the following little ruby script.
File.open("channels.xml").each do |line|
if line.match('(mms:\/\/{1}[a-zA-Z\.\d\/\w-]+)')
puts line.match('(mms:\/\/{1}[a-zA-Z\.\d\/\w-]+)')
end
end
Thanks in advance for any suggestions.
The original:
File.open("channels.xml").each do |line|
if line.match('(mms:\/\/{1}[a-zA-Z\.\d\/\w-]+)')
puts line.match('(mms:\/\/{1}[a-zA-Z\.\d\/\w-]+)')
end
end
can be changed into this:
m = nil
open("channels.xml").each do |line|
puts m if m = line.match(%r|(mms://{1}[\w\./-]+)|)
end
File.open can be changed to just open.
if XYZ
puts XYZ
end
can be changed to puts x if x = XYZ as long as x has occurred at some place in the current scope before the if statement.
The Regexp '(mms:\/\/{1}[a-zA-Z\.\d\/\w-]+)' can be refactored a little bit. Using the %rXX notation, you can create regular expressions without the need for so many backslashes, where X is any matching character, such as ( and ) or in the example above, | |.
This character class [a-zA-Z\.\d\/\w-] (read: A to Z, case insensitive, the period character, 0 to 9, a forward slash, any word character, or a dash) is a little redundant. \w denotes "word characters", i.e. A-Za-z0-9 and underscore. Since you specify \w as a positive match, A-Za-z and \d are redundant.
Using those 2 cleanups, the Regexp can be changed into this: %r|(mms://{1}[\w\./-]+)|
If you'd like to avoid the weird m = nil scoping sorcery, this will also work, but is less idiomatic:
open("channels.xml").each do |line|
m = line.match(%r|(mms://{1}[\w\./-]+)|) and puts m
end
or the longer, but more readable version:
open("channels.xml").each do |line|
if m = line.match(%r|(mms://{1}[\w\./-]+)|)
puts m
end
end
One very easy to read approach is just to store the result of the match, then only print if there's a match:
File.open("channels.xml").each do |line|
m = line.match('(mms:\/\/{1}[a-zA-Z\.\d\/\w-]+)')
puts m if m
end
If you want to start getting clever (and have less-readable code), use $& which is the global variable that receives the match variable:
File.open("channels.xml").each do |line|
puts $& if line.match('(mms:\/\/{1}[a-zA-Z\.\d\/\w-]+)')
end
Personally, I would probably just use the POSIX grep command. But there is Enumerable#grep in Ruby, too:
puts File.readlines('channels.xml').grep(%r|mms://{1}[\w\./-]+|)
Alternatively, you could use some of Ruby's file and line processing magic that it inherited from Perl. If you pass the -p flag to the Ruby interpreter, it will assume that the script you pass in is wrapped with while gets; ...; end and at the end of each loop it will print the current line. You can then use the $_ special variable to access the current line and use the next keyword to skip iteration of the loop if you don't want the line printed:
ruby -pe 'next unless $_ =~ %r|mms://{1}[\w\./-]+|' channels.xml
Basically,
ruby -pe 'next unless $_ =~ /re/' file
is equivalent to
grep -E re file

Resources