Named capture in Ruby's regular expressions - ruby

I am trying to extract information from a line of text with relatively long regular expression. Below is a simplified regexp that describes the problem.
line = "Internet 10.9.68.178 127 c07b.bce9.7d41 ARPA Vlan2"
If I try to match this line directly without trying to 'save' regexp into a variable, it works very well:
[223] pry(main)> /Internet\s+(?<ipaddr>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/ =~ line
=> 0
[224] pry(main)> ipaddr
=> "10.9.68.178"
[225] pry(main)> $1
=> "10.9.68.178"
Now, when I try to do exact same thing with 'stored' version of the regexp, it fails miserably:
[226] pry(main)> ipaddr = nil # ensure that it's cleared before match
[227] pry(main)> myreg = /Internet\s+(?<ipaddr>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/
=> /Internet\s+(?<ipaddr>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/
[228] pry(main)> myreg =~ line
=> 0
[229] pry(main)> ipaddr
=> nil
[230] pry(main)> $1
=> "10.9.68.178"
I have also tried to call match method directly and it seems to work:
[231] pry(main)> myreg.match(line)
=> #<MatchData "Internet 10.9.68.178" ipaddr:"10.9.68.178">
but this means for a simple if statement I need to do something like this:
if m = myreg.match(line)
do_stuff m[:ipaddr]
end
instead of simply
if myreg =~ line
do_stuff ipaddr
end
Any ideas as to why the names are not captured correctly in this instance?

Interesting. I've looked this up in the Ruby Documentation.
It says there:
The assignment does not occur if the regexp is not a literal.
That's why /Internet\s+(?<ipaddr>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/ =~ line works, but myreg =~ line does not.
Thanks for making me learn something new. :)

Related

inserting variable value in regex in ruby script

I am having a ruby script file for patter match. my input string look like below
this.plugin = document.getElementById("pluginPlayer");
my regex look like
regxPlayerVariable = '(.*?)=.*?document\.getElementById\("#{Regexp.escape(pluginPlayeVariable)}"\)'
here pluginPlayeVariable is a variable but its not macthing with input string.
if i change my rege and replace variable with its value it's work fine but i can not do that as it's a run time value which change accordingly.
i also tried some more regex mention below
regxPlayerVariable = '(.*?)=.*?document\.getElementById\("#{pluginPlayeVariable}"\)'
so how can i solve this issue?
First of all, regxPlayerVariable is not a Regexp, it's a String. And the reason why your interpolation does not work is because you are using single quotes. Look:
foo = "bar"
puts '#{foo}' # => #{foo}
puts "#{foo}" # => bar
puts %q{#{foo}} # => #{foo}
puts %Q{#{foo}} # => bar
puts %{#{foo}} # => bar
puts /#{foo}/ # => (?-mix:bar)
puts %r{#{foo}} # => (?-mix:bar)
Only the last two are actually regular expressions, but here you can see which quoting expressions do interpolation, and which don't.

How to convert a backslash hexadecimal string to a binary string in Ruby? [duplicate]

Does Ruby have any built-in method for escaping and unescaping strings? In the past, I've used regular expressions; however, it occurs to me that Ruby probably does such conversions internally all the time. Perhaps this functionality is exposed somewhere.
So far I've come up with these functions. They work, but they seem a bit hacky:
def escape(s)
s.inspect[1..-2]
end
def unescape(s)
eval %Q{"#{s}"}
end
Is there a better way?
Ruby 2.5 added String#undump as a complement to String#dump:
$ irb
irb(main):001:0> dumped_newline = "\n".dump
=> "\"\\n\""
irb(main):002:0> undumped_newline = dumped_newline.undump
=> "\n"
With it:
def escape(s)
s.dump[1..-2]
end
def unescape(s)
"\"#{s}\"".undump
end
$irb
irb(main):001:0> escape("\n \" \\")
=> "\\n \\\" \\\\"
irb(main):002:0> unescape("\\n \\\" \\\\")
=> "\n \" \\"
There are a bunch of escaping methods, some of them:
# Regexp escapings
>> Regexp.escape('\*?{}.')
=> \\\*\?\{\}\.
>> URI.escape("test=100%")
=> "test=100%25"
>> CGI.escape("test=100%")
=> "test%3D100%25"
So, its really depends on the issue you need to solve. But I would avoid using inspect for escaping.
Update - there is a dump, inspect uses that, and it looks like it is what you need:
>> "\n\t".dump
=> "\"\\n\\t\""
Caleb function was the nearest thing to the reverse of String #inspect I was able to find, however it contained two bugs:
\\ was not handled correctly.
\x.. retained the backslash.
I fixed the above bugs and this is the updated version:
UNESCAPES = {
'a' => "\x07", 'b' => "\x08", 't' => "\x09",
'n' => "\x0a", 'v' => "\x0b", 'f' => "\x0c",
'r' => "\x0d", 'e' => "\x1b", "\\\\" => "\x5c",
"\"" => "\x22", "'" => "\x27"
}
def unescape(str)
# Escape all the things
str.gsub(/\\(?:([#{UNESCAPES.keys.join}])|u([\da-fA-F]{4}))|\\0?x([\da-fA-F]{2})/) {
if $1
if $1 == '\\' then '\\' else UNESCAPES[$1] end
elsif $2 # escape \u0000 unicode
["#$2".hex].pack('U*')
elsif $3 # escape \0xff or \xff
[$3].pack('H2')
end
}
end
# To test it
while true
line = STDIN.gets
puts unescape(line)
end
Update: I no longer agree with my own answer, but I'd prefer not to delete it since I suspect that others may go down this wrong path, and there's already been a lot of discussion of this answer and it's alternatives, so I think it still contributes to the conversation, but please don't use this answer in real code.
If you don't want to use eval, but are willing to use the YAML module, you can use it instead:
require 'yaml'
def unescape(s)
YAML.load(%Q(---\n"#{s}"\n))
end
The advantage to YAML over eval is that it is presumably safer. cane disallows all usage of eval. I've seen recommendations to use $SAFE along with eval, but that is not available via JRuby currently.
For what it is worth, Python does have native support for unescaping backslashes.
Ruby's inspect can help:
"a\nb".inspect
=> "\"a\\nb\""
Normally if we print a string with an embedded line-feed, we'd get:
puts "a\nb"
a
b
If we print the inspected version:
puts "a\nb".inspect
"a\nb"
Assign the inspected version to a variable and you'll have the escaped version of the string.
To undo the escaping, eval the string:
puts eval("a\nb".inspect)
a
b
I don't really like doing it this way. It's more of a curiosity than something I'd do in practice.
YAML's ::unescape doesn't seem to escape quote characters, e.g. ' and ". I'm guessing this is by design, but it makes me sad.
You definitely do not want to use eval on arbitrary or client-supplied data.
This is what I use. Handles everything I've seen and doesn't introduce any dependencies.
UNESCAPES = {
'a' => "\x07", 'b' => "\x08", 't' => "\x09",
'n' => "\x0a", 'v' => "\x0b", 'f' => "\x0c",
'r' => "\x0d", 'e' => "\x1b", "\\\\" => "\x5c",
"\"" => "\x22", "'" => "\x27"
}
def unescape(str)
# Escape all the things
str.gsub(/\\(?:([#{UNESCAPES.keys.join}])|u([\da-fA-F]{4}))|\\0?x([\da-fA-F]{2})/) {
if $1
if $1 == '\\' then '\\' else UNESCAPES[$1] end
elsif $2 # escape \u0000 unicode
["#$2".hex].pack('U*')
elsif $3 # escape \0xff or \xff
[$3].pack('H2')
end
}
end
I suspect that Shellwords.escape will do what you're looking for
https://ruby-doc.org/stdlib-1.9.3/libdoc/shellwords/rdoc/Shellwords.html#method-c-shellescape

Why does capturing named groups in Ruby result in "undefined local variable or method" errors?

I am having trouble with named captures in regular expressions in Ruby 2.0. I have a string variable and an interpolated regular expression:
str = "hello world"
re = /\w+/
/(?<greeting>#{re})/ =~ str
greeting
It raises the following exception:
prova.rb:4:in <main>': undefined local variable or methodgreeting' for main:Object (NameError)
shell returned 1
However, the interpolated expression works without named captures. For example:
/(#{re})/ =~ str
$1
# => "hello"
Named Captures Must Use Literals
You are encountering some limitations of Ruby's regular expression library. The Regexp#=~ method limits named captures as follows:
The assignment does not occur if the regexp is not a literal.
A regexp interpolation, #{}, also disables the assignment.
The assignment does not occur if the regexp is placed on the right hand side.
You'll need to decide whether you want named captures or interpolation in your regular expressions. You currently cannot have both.
Assign the result of #match; this will be accessible as a hash that allows you to look up your named capture groups:
> matches = "hello world".match(/(?<greeting>\w+)/)
=> #<MatchData "hello" greeting:"hello">
> matches[:greeting]
=> "hello"
Alternately, give #match a block, which will receive the match results:
> "hello world".match(/(?<greeting>\w+)/) {|matches| matches[:greeting] }
=> "hello"
As an addendum to both answers in order to make it crystal clear:
str = "hello world"
# => "hello world"
re = /\w+/
# => /\w+/
re2 = /(?<greeting>#{re})/
# => /(?<greeting>(?-mix:\w+))/
md = re2.match str
# => #<MatchData "hello" greeting:"hello">
md[:greeting]
# => "hello"
Interpolation is fine with named captures, just use the MatchData object, most easily returned via match.

Couldn't understand why the Regexp option i got disabled in my code

I have just started playing with Ruby and I'm stuck on something. Is
there some trick to modify the casefold attribute of a Regexp object after
it's been instantiated?
The best idea what I tried is the following:
irb(main):001:0> a = Regexp.new('a')
=> /a/
irb(main):002:0> aA = Regexp.new(a.to_s, Regexp::IGNORECASE)
=> /(?-mix:a)/i
But none of the below seems to work:
irb(main):003:0> a =~ 'a'
=> 0
irb(main):004:0> a =~ 'A'
=> nil
irb(main):005:0> aA =~ 'a'
=> 0
irb(main):006:0> aA =~ 'A'
=> nil
Something I don't understand is happening here. Where did the 'i' go on line
8?
irb(main):07:0> aA = Regexp.new(a.to_s, Regexp::IGNORECASE)
=> /(?-mix:a)/i
irb(main):08:0> aA.to_s
=> "(?-mix:a)"
irb(main):09:0>
I am using Ruby 1.9.3.
I am also unable understand the below code: why returning false:
/(?i:a)/.casefold? #=> false
As your console output shows, a.to_s includes the case sensitiveness as an option for your subexpression, so aA is being defined as
/(?-mix:a)/i
so you're asking ruby for a regular expression that is case insensitive, but the only thing in that case insensitive regexp is a group for when case sensitivity has be turned on, so the net effect is that 'a' is matched case sensitively
Since the result of to_s is just the regular expression string itself - no delimiters or external flags - the flags are translated into the (?i:...) syntax that sets or clears them temporarily inside the expression itself. This lets you get a Regexp object back out via a simple Regexp.new(s) call that will match the same strings.
The wrapping, unfortunately, includes explicitly clearing the flags that are not set on the object. So your first regex gets stringified into something between (?:-i...) - that is, the casefold option is explicitly turned off between the parentheses. Turning it back on for the object doesn't have any effect.
You can use a.source instead of a.to_s to get just the original expression, without the flag settings:
irb(main):001:0> a=/a/
=> /a/
irb(main):002:0> aA = Regexp.new(a.source, Regexp::IGNORECASE)
=> /a/i
irb(main):003:0> a =~ 'a'
=> 0
irb(main):004:0> a =~ 'A'
=> nil
irb(main):005:0> aA =~ 'a'
=> 0
irb(main):006:0> aA =~ 'A'
=> 0
As Frederick already explains, calling to_s on a regex will add modifiers around it that ensure that its properties like case-sensitiveness are preserved. So if you insert a case-sensitive regex into a case-insensitive regex, the inserted part will still be case-sensitive. Likewise the modifiers given to Regexp.new will have no effect if the first argument is a regex or the result of calling to_s on one.
To solve this issue, call source on the regex instead of to_s. Unlike to_s, source simply returns the source of regex without adding anything:
aA = Regexp.new(a.source, Regexp::IGNORECASE)
I am also unable understand the below code: why returning false:
/(?i:a)/.casefold?
Because (?i:...) sets the i flag locally, not globally. It only applies to the part of the regex within the parentheses, not the whole regex. Of course in this case the whole regex is within the parentheses, but that doesn't matter as far as methods like casefold? are concerned.

Ruby 1.9.3 regular expressions with gsub: Bugs or features?

Take this snippet of code which is supposed to replace a href tag with its URL:
irb> s='<p>Click here!</p>'
irb> s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^<]*)<\/a>/, "#{$1}")
=> "<p></p>"
This regex fails (URL is not found). Then I escape the < character in the regex, and it works:
irb> s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^\<]*)<\/a>/, "#{$1}")
=> "<p>http://localhost/activate/57f7e805827f</p>"
1: According to RubyMine's inspections, this escape should not be necessary. Is this correct? If so, why is the escape of > apparently not necessary as well?
2: Afterwards in the same IRB session, with the same string, the original regex suddenly works too:
irb> s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^<]*)<\/a>/, "#{$1}")
=> "<p>http://localhost/activate/57f7e805827f</p>"
Is this because the $1 variable is not cleared when calling gsub again? If so, is it intentional behaviour or is this a Ruby regex bug?
3: When I change the string, and reexecute the same command, $1 will only change after calling gsub twice on the changed string:
irb> s='<p>Click here!</p>'
=> "<p>Click here!</p>"
irb> s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^\<]*)<\/a>/, "#{$1}")
=> "<p>http://localhost/activate/57f7e805827f</p>"
irb> s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^\<]*)<\/a>/, "#{$1}")
=> "<p>http://localhost/activate/xxxxyyy</p>"
Is this intentional? If so, what is the logic behind this?
4: As replacement character, some tutorials suggest using "#{$n}", others suggest using '\n'. With the backslash variant, the problems above do not appear. Why - what is the difference between the two?
Thank you!
$1 contains the first capture of the last match. In your example, it is evaluated before the matching (actually even before gsub is called), therefore the value of $1 is fixed to nil (because you did not match anything, yet). So you always get the first capture of the previous match, you do not even need to change your original regex to get the expected result the second time:
s='<p>Click here!</p>'
s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^<]*)<\/a>/, "#{$1}")
# => "<p></p>"
s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^<]*)<\/a>/, "#{$1}")
# => "<p>http://localhost/activate/57f7e805827f</p>"
You can pass a block to gsub though, which is evaluated after the matching, e. g.
s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^<]*)<\/a>/){ $1 }
# => "<p>http://localhost/activate/57f7e805827f</p>"
This way, $1 behaves as you'd expect. I like to always use named captures so i don't have to keep track of the numbers when i add a capture, though:
s.gsub(/<a href="(?<href>([^ '"]*))"([^>]*)?>([^<]*)<\/a>/){ $~[:href] }
# => "<p>http://localhost/activate/57f7e805827f</p>"

Resources