Is there a bug in Ruby lookbehind assertions (1.9/2.0)? - ruby

Why doesn't the regex (?<=fo).* match foo (whereas (?<=f).* does)?
"foo" =~ /(?<=f).*/m => 1
"foo" =~ /(?<=fo).*/m => nil
This only seems to happen with singleline mode turned on (dot matches newline); without it, everything is OK:
"foo" =~ /(?<=f).*/ => 1
"foo" =~ /(?<=fo).*/ => 2
Tested on Ruby 1.9.3 and 2.0.0.
See it on Rubular
EDIT: Some more observations:
Adding an end-of-line anchor doesn't change anything:
"foo" =~ /(?<=fo).*$/m => nil
But together with a lazy quantifier, it "works":
"foo" =~ /(?<=fo).*?$/m => 2
EDIT: And some more observations:
.+ works as does its equivalent {1,}, but only in Ruby 1.9 (it seems that that's the only behavioral difference between the two in this scenario):
"foo" =~ /(?<=fo).+/m => 2
"foo" =~ /(?<=fo).{1,}/ => 2
In Ruby 2.0:
"foo" =~ /(?<=fo).+/m => nil
"foo" =~ /(?<=fo).{1,}/m => nil
.{0,} is busted (in both 1.9 and 2.0):
"foo" =~ /(?<=fo).{0,}/m => nil
But {n,m} works in both:
"foo" =~ /(?<=fo).{0,1}/m => 2
"foo" =~ /(?<=fo).{0,2}/m => 2
"foo" =~ /(?<=fo).{0,999}/m => 2
"foo" =~ /(?<=fo).{1,999}/m => 2

This has been officially classified as a bug and subsequently fixed, together with another problem concerning \Z anchors in multiline strings.

Related

How do I match something that is not a letter or a number or a space?

I'm using Ruby 2.4. How do I match something that is not a letter or a number or a space? I tried
2.4.0 :004 > str = "-"
=> "-"
2.4.0 :005 > str =~ /[^[:alnum:]]*/
=> 0
2.4.0 :006 > str = " "
=> " "
2.4.0 :007 > str =~ /[^[:alnum:]]*/
=> 0
but as you can see it is still matching a space.
Your /[^[:alnum:]]*/ pattern matches 0 or more symbols other than alphanumeric chars. It will match whitespace.
To match 1 or more chars other than alphanumeric and whitespace, you can use
/[^[:alnum:][:space:]]+/
Use the negated bracket expression with the relevant POSIX character classes inside.

How do I keep the split token in the second part of what was split in Ruby?

In Ruby, how do you split a stirng and keep the token with which you are splitting on in the second part of the result of the split? I have
line.split(/(?<=#{Regexp.escape(split_token)})/)
But the token is getting merged into the first part of teh split and I want it in the second part
2.4.0 :004 > split_token = "aaa"
=> "aaa"
2.4.0 :005 > line = "bbb aaa ccc"
=> "bbb aaa ccc"
2.4.0 :006 > line.split(/(?<=#{Regexp.escape(split_token)})/)
=> ["bbb aaa", " ccc"]
Changing lookbehind ((?<=) to lookahead ((?=) seems to do the trick:
split_token = "aaa"
line = "bbb aaa ccc"
line.split(/(?=#{Regexp.escape(split_token)})/)
# => ["bbb ", "aaa ccc"]
This just changes the split point to before the token rather than after it.
Another possibility is to use slice_before :
line.split.slice_before('aaa').map{|s| s.join(' ')}

Chaining sed statements

I'm running a dozen of sed commands for each Capistranio deploy and I was wondering, if it's possible to chain them into 1 single sed command, instead of firing dozens at the server.
task :taskname do
{:'foo' => foo, :'bar' => bar, :'foobar' => foobar, :'fubar' => fubar }.each do |search, replace|
run "sed -i 's/#{search}/#{replace}/' file.ext"
end
end
sed natively accepts a dozen of patterns (if you for some reason prefer sed):
{:foo => foo, :bar => bar, :foobar => foobar, :fubar => fubar}.inject("") do |acc, k, v|
acc += " -e 's/#{k}/#{v}'"
end
run "sed #{acc} file.ext"
Does mudasobwa's code work? With my Ruby (v1.9.3), it has to be:
acc = {:foo => foo, :bar => bar, :foobar => foobar, :fubar => fubar}.inject("") do |m, p|
m + " -e 's/#{p[0]}/#{p[1]}'"
end
run "sed #{acc} file.ext"

Ruby 2.0 regex and cyrillic

Before ruby 2.0, regex worked this way:
/\A[a-zа-я\d]+\z/i =~ 'привет' # => 0
/\A[a-z\p{Cyrillic}\d]+\z/i =~ 'привет' # => 0
I updated ruby 2.0, and it has a bug:
/\A[a-zа-я\d]+\z/i =~ 'привет' # => nil
/\A[a-z\p{Cyrillic}\d]+\z/i =~ 'привет' # => nil
How can I deal with this problem? Without \d in the character class, it works correctly:
/\A[a-zа-я]+\z/i =~ 'привет' # => 0
This bug looks similar and may be related to this bug that I asked about before. I reported it to ruby trunk, and it has been accepted as a bug. Hopefully, it will be fixed.
The bug seems to be fixed in ruby-head:
⮀ rvm use ruby-2.0.0-preview2
Using /home/am/.rvm/gems/ruby-2.0.0-preview2
⮀ irb
2.0.0dev :001 > regex = /\A[a-zа-я\d]+\z/i ; regex =~ 'привет'
# ⇒ nil
⮀ rvm use ruby-2.0.0-preview1
Using /home/am/.rvm/gems/ruby-2.0.0-preview1
⮀ irb
2.0.0dev :001 > regex = /\A[a-zа-я\d]+\z/i ; regex =~ 'привет'
# ⇒ nil
⮀ rvm use ruby-head
Using /home/am/.rvm/gems/ruby-head
⮀ irb
irb(main):001:0> regex = /\A[a-zа-я\d]+\z/i ; regex =~ 'привет'
# ⇒ 0

Match newline `\n` in ruby regex

I'm trying to understand why the following returns false: (** I should have put "outputs 0" **)
puts "a\nb" =~ Regexp.new(Regexp.escape("a\nb"), Regexp::MULTILINE | Regexp::EXTENDED)
Perhaps someone could explain.
I am trying to generate a Regexp from a multi-line String that will match the String.
Thanks in advance
puts will always return nil.
Your code should work fine, albeit lengthy. =~ returns the position of the match which is 0.
You could also use:
"a\nb" =~ /a\sb/m
or
"a\nb" =~ /a\nb/m
Note: The m option isn't necessary in this example but demonstrates how it would be used without Regexp.new.
Probably, puts caused this
1.9.3-194 (main):0 > puts ("a\nb" =~ Regexp.new(Regexp.escape("a\nb"), Regexp::MULTILINE | Regexp::EXTENDED) )
0
=> nil
1.9.3-194 (main):0 > "a\nb" =~ Regexp.new(Regexp.escape("a\nb"), Regexp::MULTILINE | Regexp::EXTENDED)
=> 0

Resources