Use a regex to match which is stored in a hash - ruby

Here is a Ruby code that explains the issue:
1.8.7 :018 > pattern[:key] = '554 5.7.1.*Service unavailable; Helo command .* blocked using'
=> "554 5.7.1.*Service unavailable; Helo command .* blocked using"
1.8.7 :019 > line = '554 5.7.1 Service unavailable; Helo command [abc] blocked using dbl'
=> "554 5.7.1 Service unavailable; Helo command [abc] blocked using dbl"
1.8.7 :020 > line =~ /554 5.7.1.*Service unavailable; Helo command .* blocked using/
=> 0
1.8.7 :021 > line =~ /pattern[:key]/
=> nil
1.8.7 :022 >
Regex works when using directly as a string but doesn't work when I'm using it as a value of hash.

This isn't a Ruby question per se, it's how to construct a regex pattern that accomplishes what you want.
In "regex-ese", /pattern[:key]/ means:
Find pattern.
Following pattern look for one of :, k, e or y.
Ruby doesn't automatically interpolate variables in strings or regex patterns like Perl does, so, instead, we have to mark where the variable is using #{...} inline.
If you're only using /pattern[:key]/ as a pattern, don't bother interpolating it into a pattern. Instead, take the direct path and let Regexp do it for you:
pattern[:key] = 'foo'
Regexp.new(pattern[:key])
=> /foo/
Which is the same result as:
/#{pattern[:key]}/
=> /foo/
but doesn't waste CPU cycles.
Another of your attempts used ., [ and ], which are reserved characters in patterns, used to help define patterns. If you need to use such characters, you can have Ruby's Regexp.escape add \ escape characters appropriately, preserving their normal/literal meaning in the string:
Regexp.escape('5.7.1 [abc]')
=> "5\\.7\\.1\\ \\[abc\\]"
which, in real life is "5\.7\.1\ \[abc\]" (when not being displayed in IRB)
To use that in a regex, use:
Regexp.new(Regexp.escape('5.7.1 [abc]'))
=> /5\.7\.1\ \[abc\]/

line =~ /#{pattern[:key]}/
or...
line =~ Regexp.new pattern[:key]
If you want to escape regex special characters...
line =~ /#{Regexp.quote pattern[:key]}/
Edit: Since you're new to ruby, may I suggest you do this, wherever pattern is defined:
pattern[:key] = Regexp.new '554 5.7.1.*Service unavailable; Helo command .* blocked using'
Then you can simple use the Regexp object stored in pattern
line =~ pattern[:key]

Related

Why this regular expression matches /\w+[^(]/?

>> 'hola(' =~ /\w+[^(]/
=> 0
>> $&
=> "hola"
As far as I know, /\w+[^(]/ should match a word not followed by a (. I also tried with negative look-behind and escaping the (; with the same results.
What I find estrange about this, is that if I try with /a[^(]/, it works (as expected).
>> 'hola(' =~ /a[^(]/
=> nil
So it definitely has something to do with the + quantifier.
What's happening?
I tried using Ruby 2.2 and Python 3.3
The hol portion matches the \w+ portion of the expression, and the [^(] portion matches a. The ( part of the input is ignored.
Add $ to fix the problem:
>> 'hola(' =~ /\w+[^(]$/
=> nil
In this example, the regex engine works like this: scan down to hola which matches \w+ part, but the next character ( doesn't match [^(], then does backtracking, and find out that it matches when hol to \w+ part, and a matches [^(] part.
You can inhibit the backtracking by using (?>re) for an independent regex engine:
'hola(' =~ /(?>\w+)[^(]/
# => nil
or ++ repetition for possessive:
'hola(' =~ /\w++[^(]/
# => nil
Both are Ruby Regexp extensions.
Actually \w+[^(] matches one or more word-characters, followed by any character that isn't a (.
Which means it will match hola and ignore the last character.
Depending on where you want to use this, a pattern like \w+\b(?!\() might be more suitable, as it will match any word not followed by (.

Replacing regex capture with the same capture and an extra string

I am trying to escape certain characters in a string. In particular, I want to turn
abc/def.ghi into abc\/def\.ghi
I tried to use the following syntax:
1.9.3p125 :076 > "abc/def.ghi".gsub(/([\/.])/, '\\\1')
=> "abc\\1def\\1ghi"
Hmm. This behaves as if capture replacements didn't work. Yet, when I tried this:
1.9.3p125 :075 > "abc/def.ghi".gsub(/([\/.])/, '\1')
=> "abc/def.ghi"
... I got the replacement to work, but, of course, my prefixes weren't part of it.
What is the correct syntax to do something like this?
This should be easier
gsub(/(?=[.\/])/, "\\")
If you are trying to prepare a string to be used as a regex pattern, use the right tool:
Regexp.escape('abc/def.ghi')
=> "abc/def\\.ghi"
You can then use the resulting string to create a regex:
/#{ Regexp.escape('abc/def.ghi') }/
=> /abc\/def\.ghi/
or:
Regexp.new(Regexp.escape('abc/def.ghi'))
=> /abc\/def\.ghi/
From the docs:
Escapes any characters that would have special meaning in a regular expression. Returns a new escaped string, or self if no characters are escaped. For any string, Regexp.new(Regexp.escape(str))=~str will be true.
Regexp.escape('\*?{}.') #=> \\\*\?\{\}\.
You can pass a block to gsub:
>> "abc/def.ghi".gsub(/([\/.])/) {|m| "\\#{m}"}
=> "abc\\/def\\.ghi"
Not nearly as elegant as #sawa's answer, but it was the only way I could find to get it to work if you need the replacing string to contain the captured group/backreference (rather than inserting the replacement before the look-ahead).

Ruby 1.9.3 regular expressions with gsub: Bugs or features?

Take this snippet of code which is supposed to replace a href tag with its URL:
irb> s='<p>Click here!</p>'
irb> s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^<]*)<\/a>/, "#{$1}")
=> "<p></p>"
This regex fails (URL is not found). Then I escape the < character in the regex, and it works:
irb> s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^\<]*)<\/a>/, "#{$1}")
=> "<p>http://localhost/activate/57f7e805827f</p>"
1: According to RubyMine's inspections, this escape should not be necessary. Is this correct? If so, why is the escape of > apparently not necessary as well?
2: Afterwards in the same IRB session, with the same string, the original regex suddenly works too:
irb> s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^<]*)<\/a>/, "#{$1}")
=> "<p>http://localhost/activate/57f7e805827f</p>"
Is this because the $1 variable is not cleared when calling gsub again? If so, is it intentional behaviour or is this a Ruby regex bug?
3: When I change the string, and reexecute the same command, $1 will only change after calling gsub twice on the changed string:
irb> s='<p>Click here!</p>'
=> "<p>Click here!</p>"
irb> s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^\<]*)<\/a>/, "#{$1}")
=> "<p>http://localhost/activate/57f7e805827f</p>"
irb> s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^\<]*)<\/a>/, "#{$1}")
=> "<p>http://localhost/activate/xxxxyyy</p>"
Is this intentional? If so, what is the logic behind this?
4: As replacement character, some tutorials suggest using "#{$n}", others suggest using '\n'. With the backslash variant, the problems above do not appear. Why - what is the difference between the two?
Thank you!
$1 contains the first capture of the last match. In your example, it is evaluated before the matching (actually even before gsub is called), therefore the value of $1 is fixed to nil (because you did not match anything, yet). So you always get the first capture of the previous match, you do not even need to change your original regex to get the expected result the second time:
s='<p>Click here!</p>'
s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^<]*)<\/a>/, "#{$1}")
# => "<p></p>"
s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^<]*)<\/a>/, "#{$1}")
# => "<p>http://localhost/activate/57f7e805827f</p>"
You can pass a block to gsub though, which is evaluated after the matching, e. g.
s.gsub(/<a href="([^ '"]*)"([^>]*)?>([^<]*)<\/a>/){ $1 }
# => "<p>http://localhost/activate/57f7e805827f</p>"
This way, $1 behaves as you'd expect. I like to always use named captures so i don't have to keep track of the numbers when i add a capture, though:
s.gsub(/<a href="(?<href>([^ '"]*))"([^>]*)?>([^<]*)<\/a>/){ $~[:href] }
# => "<p>http://localhost/activate/57f7e805827f</p>"

Stumped by a simple regex

I am trying to see if the string s contains any of the symbols in a regex. The regex below works fine on rubular.
s = "asd#d"
s =~ /[~!##$%^&*()]+/
But in Ruby 1.9.2, it gives this error message:
syntax error, unexpected ']', expecting tCOLON2 or '[' or '.'
s = "asd#d"; s =~ /[~!##$%^&*()]/
What is wrong?
This is actually a special case of string interpolation with global and instance variables that most seem not to know about. Since string interpolation also occurs within regex in Ruby, I'll illustrate below with strings (since they provide for an easier example):
#foo = "instancefoo"
$foo = "globalfoo"
"##foo" # => "instancefoo"
"#$foo" # => "globalfoo"
Thus you need to escape the # to prevent it from being interpolated:
/[~!#\#$%^&*()]+/
The only way that I know of to create a non-interpolated regex in Ruby is from a string (note single quotes):
Regexp.new('[~!##$%^&*()]+')
I was able to replicate this behavior in 1.9.3p0. Apparently there is a problem with the '#$' combination. If you escape either it works. If you reverse them it works:
s =~ /[~!#$#%^&*()]+/
Edit: in Ruby 1.9 #$ invokes variable interpolation, even when followed by a % which is not a valid variable name.
I disagree, you need to escape the $, its the end of string character.
s =~ /[~!##\$%^&*()]/ => 3
That is correct.

How do I remove a non-breaking space in Ruby

I have a string that looks like this:
d = "foo\u00A0\bar"
When I check the length, it says that it is 7 characters long. I checked online and found out that it is a non-breaking space. Could someone show me how to remove all the non-breaking spaces in a string?
In case you do not care about the non-breaking space specifically, but about any "special" unicode whitespace character that might appear in your string, you can replace it using the POSIX bracket expression for whitespace:
s.gsub(/[[:space:]]/, '')
These bracket expressions (as opposed to matchers like \s) do not only match ASCII characters, but all unicode characters of a class.
For more details see the ruby documentation
irb(main):001:0> d = "foo\u00A0\bar"
=> "foo \bar"
irb(main):002:0> d.gsub("\u00A0", "")
=> "foo\bar"
It's an old thread but maybe it helps somebody.
I found myself looking for a solution to the same problem when I discovered that strip doesn't do the job. I checked with method ord what the character was and used chr to represent it in gsub
2.2.3 :010 > 160.chr("UTF-8")
=> " "
2.2.3 :011 > 160.chr("UTF-8").strip
=> " "
2.2.3 :012 > nbsp = 160.chr("UTF-8")
=> " "
2.2.3 :013 > nbsp.gsub(160.chr("UTF-8"),"")
=> ""
I couldn't understand why strip doesn't remove something that looked like a space to me so I checked here what ASCII 160 actually is.
d.gsub("\u00A0", "") does not work in Ruby 1.8. Instead use d.gsub(/\302\240/,"")
See http://blog.grayproductions.net/articles/understanding_m17n for lots more on the character encoding differences between 1.8 and 1.9.

Resources