Regexp using literal vs. escaped character - ruby

Is there any practical difference between a regexp using an escape character versus one using the literal character? I.e. are there any situations where matching with them will return different results?
Example in Ruby:
literal = Regexp.new("\t")
=> / /
escaped = Regexp.new("\\t")
=> /\t/
# They're different...
literal == escaped
=> false
# ...but they seem to match the same:
"Hello\tWorld".match(literal)
=> #<MatchData "\t">
"Hello\tWorld".match(escaped)
=> #<MatchData "\t">

No, not in the case of \t (or \n).
But it won't work in most other cases (e.g., escape sequences that either don't have a 1:1 equivalent in string escapes like \s or where the meaning differs like \b), so it's generally a good idea to use the escaped versions (or construct the regex using /.../ in the first place).

Related

ruby - weird duplication with backtick in gsub [duplicate]

s = "#main= 'quotes'
s.gsub "'", "\\'" # => "#main= quotes'quotes"
This seems to be wrong, I expect to get "#main= \\'quotes\\'"
when I don't use escape char, then it works as expected.
s.gsub "'", "*" # => "#main= *quotes*"
So there must be something to do with escaping.
Using ruby 1.9.2p290
I need to replace single quotes with back-slash and a quote.
Even more inconsistencies:
"\\'".length # => 2
"\\*".length # => 2
# As expected
"'".gsub("'", "\\*").length # => 2
"'a'".gsub("'", "\\*") # => "\\*a\\*" (length==5)
# WTF next:
"'".gsub("'", "\\'").length # => 0
# Doubling the content?
"'a'".gsub("'", "\\'") # => "a'a" (length==3)
What is going on here?
You're getting tripped up by the specialness of \' inside a regular expression replacement string:
\0, \1, \2, ... \9, \&, \`, \', \+
Substitutes the value matched by the nth grouped subexpression, or by the entire match, pre- or postmatch, or the highest group.
So when you say "\\'", the double \\ becomes just a single backslash and the result is \' but that means "The string to the right of the last successful match." If you want to replace single quotes with escaped single quotes, you need to escape more to get past the specialness of \':
s.gsub("'", "\\\\'")
Or avoid the toothpicks and use the block form:
s.gsub("'") { |m| '\\' + m }
You would run into similar issues if you were trying to escape backticks, a plus sign, or even a single digit.
The overall lesson here is to prefer the block form of gsub for anything but the most trivial of substitutions.
s = "#main = 'quotes'
s.gsub "'", "\\\\'"
Since \it's \\equivalent if you want to get a double backslash you have to put four of ones.
You need to escape the \ as well:
s.gsub "'", "\\\\'"
Outputs
"#main= \\'quotes\\'"
A good explanation found on an outside forum:
The key point to understand IMHO is that a backslash is special in
replacement strings. So, whenever one wants to have a literal
backslash in a replacement string one needs to escape it and hence
have [two] backslashes. Coincidentally a backslash is also special in a
string (even in a single quoted string). So you need two levels of
escaping, makes 2 * 2 = 4 backslashes on the screen for one literal
replacement backslash.
source

Match a word with backslash

I am struggling to write a Ruby regexp that will match all words which: starts with 2 or 3 letters, then have backslash (\) and then have 7 or 8 letters and digits. The expression I use is like this:
p "BFD\082BBSA".match %r{\A[a-zA-Z]{2,3}\/[a-zA-Z0-9]{7,8}\z}
But each time this code returns nil. What am I doing wrong?
Try as below :
'BFD\082BBSA'.match %r{\A[a-zA-Z]{2,3}\\[a-zA-Z0-9]{7,8}\z}
# => #<MatchData "BFD\\082BBSA">
#or
"BFD\\082BBSA".match %r{\A[a-zA-Z]{2,3}\\[a-zA-Z0-9]{7,8}\z}
# => #<MatchData "BFD\\082BBSA">
Read this also - Backslashes in Single quoted strings vs. Double quoted strings in Ruby?
The problem is that you actually have no backslash in your string, just a null Unicode character:
"BFD\082BBSA"
# => "BFD\u000082BBSA"
So you just have to escape the backslash in the string:
"BFD\\082BBSA"
# => "BFD\\082BBSA"
Moreover, as others pointed out, \/ will match a forward slash, so you have to change \/ into \\:
"BFD\\082BBSA".match(/\A[a-z]{2,3}\\[a-z0-9]{7,8}\z/i)
# => #<MatchData "BFD\\082BBSA">
You wanted to match the backward slash, but you are matching forward slash. Please change the RegEx to
[a-zA-Z]{2,3}\\[a-zA-Z0-9]{7,8}
Note the \\ instead of \/. Check the RegEx at work, here

Replacing regex capture with the same capture and an extra string

I am trying to escape certain characters in a string. In particular, I want to turn
abc/def.ghi into abc\/def\.ghi
I tried to use the following syntax:
1.9.3p125 :076 > "abc/def.ghi".gsub(/([\/.])/, '\\\1')
=> "abc\\1def\\1ghi"
Hmm. This behaves as if capture replacements didn't work. Yet, when I tried this:
1.9.3p125 :075 > "abc/def.ghi".gsub(/([\/.])/, '\1')
=> "abc/def.ghi"
... I got the replacement to work, but, of course, my prefixes weren't part of it.
What is the correct syntax to do something like this?
This should be easier
gsub(/(?=[.\/])/, "\\")
If you are trying to prepare a string to be used as a regex pattern, use the right tool:
Regexp.escape('abc/def.ghi')
=> "abc/def\\.ghi"
You can then use the resulting string to create a regex:
/#{ Regexp.escape('abc/def.ghi') }/
=> /abc\/def\.ghi/
or:
Regexp.new(Regexp.escape('abc/def.ghi'))
=> /abc\/def\.ghi/
From the docs:
Escapes any characters that would have special meaning in a regular expression. Returns a new escaped string, or self if no characters are escaped. For any string, Regexp.new(Regexp.escape(str))=~str will be true.
Regexp.escape('\*?{}.') #=> \\\*\?\{\}\.
You can pass a block to gsub:
>> "abc/def.ghi".gsub(/([\/.])/) {|m| "\\#{m}"}
=> "abc\\/def\\.ghi"
Not nearly as elegant as #sawa's answer, but it was the only way I could find to get it to work if you need the replacing string to contain the captured group/backreference (rather than inserting the replacement before the look-ahead).

Create Regex from String stored in variable with escaped characters (Ruby)

I'm trying to build a regex from a string object, which happens to be stored in a variable.
The problem I'm facing is that escaped sequences (in the string) such "\d" doesn't make to the resulting regex.
Regexp.new("\d") => /d/
If I use single quotes, tough, it works flawless.
Regexp.new('\d') => /\d/
But, as my string is stored in a variable, I always get the double-quoted string.
Is there a way to turn a double-quoted string to single-quoted string, so that I could use in the Regexp constructor ?
(I'd like to use the string interpolation feature of the double quotes)
ex.:
email_pattern = "/[a-z]*\.com"
whole_pattern = "to: #{email_pattern}"
Regexp.new(whole_pattern)
For better readability, I'd like to avoid escaping escape characters.
"\\d"
The problem is, that you end up with completely different strings, depending on whether you use single or double quotes:
"\d".chars.to_a
#=> ["d"]
'\d'.chars.to_a
#=> ["\\", "d"]
so when you are using double quotes, the single \ is immediately lost and cannot be recovered by definition, for example:
"\d" == "d"
#=> true
so you can never know what the string contained before the escaping took place. As #FrankSchmitt suggested, use the double backslash or stick with single quotes. There's no other way.
There's an option, though. You can define your regex parts as regexes themselves, instead of strings. They behave exactly as expected:
regex1 = /\d/
#=> /\d/
regex2 = /foobar/
#=> /foobar/
Then, you can build your final regex with #{}-style interpolation, instead of building the regex source from strings:
regex3 = /#{regex1} #{regex2}/
#=> /(?-mix:\d) (?-mix:foobar)/
Reflecting your example this would translate to:
email_regex = /[a-z]*\.com/
whole_regex = /to: #{email_regex}/
#=> /to: (?-mix:[a-z]*\.com)/
You may also find Regexp#escape interesting. (see the docs)
If you run into further escaping problems (with the slashes), you can also use the alternative Regexp literal syntax with %r{<your regex here>}, in which you do not need to escape the / character. For example:
%r{/}
#=> /\//
There's no getting around escaping the backslash \ with \\, though.
Either create your string with single quotes:
s = '\d'
r = Regexp.new(s)
or quote the backslash:
s = "\\d"
r = Regexp.new(s)
Both should work.

How to use double brackets in a regular expression?

What do double square brackets mean in a regex? I am confused about the following examples:
/[[^abc]]/
/[^abc]/
I was testing using Rubular, but I didn't see any difference between the one with double brackets and single brackets.
Posix character classes use a [:alpha:] notation, which are used inside a regular expression like:
/[[:alpha:][:digit:]]/
You'll need to scroll down a ways to get to the Posix information in the link above. From the docs:
POSIX bracket expressions are also similar to character classes. They provide a portable alternative to the above, with the added benefit that they encompass non-ASCII characters. For instance, /\d/ matches only the ASCII decimal digits (0-9); whereas /[[:digit:]]/ matches any character in the Unicode Nd category.
/[[:alnum:]]/ - Alphabetic and numeric character
/[[:alpha:]]/ - Alphabetic character
/[[:blank:]]/ - Space or tab
/[[:cntrl:]]/ - Control character
/[[:digit:]]/ - Digit
/[[:graph:]]/ - Non-blank character (excludes spaces, control characters, and similar)
/[[:lower:]]/ - Lowercase alphabetical character
/[[:print:]]/ - Like [:graph:], but includes the space character
/[[:punct:]]/ - Punctuation character
/[[:space:]]/ - Whitespace character ([:blank:], newline,
carriage return, etc.)
/[[:upper:]]/ - Uppercase alphabetical
/[[:xdigit:]]/ - Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F)
Ruby also supports the following non-POSIX character classes:
/[[:word:]]/ - A character in one of the following Unicode general categories Letter, Mark, Number, Connector_Punctuation
/[[:ascii:]]/ - A character in the ASCII character set
# U+06F2 is "EXTENDED ARABIC-INDIC DIGIT TWO"
/[[:digit:]]/.match("\u06F2") #=> #<MatchData "\u{06F2}">
/[[:upper:]][[:lower:]]/.match("Hello") #=> #<MatchData "He">
/[[:xdigit:]][[:xdigit:]]/.match("A6") #=> #<MatchData "A6">
'[[' doesn't have any special meaning. [xyz] is a character class and will match a single x, y or z. The carat ^ takes all characters not in the brackets.
Removing ^ for simplicity, you can see that the first open bracket is being matched with the first close bracket and the second closed bracket is being used as part of the character class. The final close bracket is treated as another character to be matched.
irb(main):032:0> /[[abc]]/ =~ "[a]"
=> 1
irb(main):033:0> /[[abc]]/ =~ "a]"
=> 0
This appears to have the same result as your original in some cases
irb(main):034:0> /[abc]/ =~ "a]"
=> 0
irb(main):034:0> /[abc]/ =~ "a"
=> 0
But this is only because your regular expression is not looking for an exact match.
irb(main):036:0> /^[abc]$/ =~ "a]"
=> nil

Resources