Backslashes in gsub (escaping and backreferencing) - ruby

Consider the following snippet:
puts 'hello'.gsub(/.+/, '\0 \\0 \\\0 \\\\0')
This prints (as seen on ideone.com):
hello hello \0 \0
This was very surprising, because I'd expect to see something like this instead:
hello \0 \hello \\0
My argument is that \ is an escape character, so you write \\ to get a literal backslash, thus \\0 is a literal backslash \ followed by 0, etc. Obviously this is not how gsub is interpreting it, so can someone explain what's going on?
And what do I have to do to get the replacement I want above?

Escaping is limited when using single quotes rather then double quotes:
puts 'sinlge\nquote'
puts "double\nquote"
"\0" is the null-character (used i.e. in C to determine the end of a string), where as '\0' is "\\0", therefore both 'hello'.gsub(/.+/, '\0') and 'hello'.gsub(/.+/, "\\0") return "hello", but 'hello'.gsub(/.+/, "\0") returns "\000". Now 'hello'.gsub(/.+/, '\\0') returning 'hello' is ruby trying to deal with programmers not keeping the difference between single and double quotes in mind. In fact, this has nothing to do with gsub: '\0' == "\\0" and '\\0' == "\\0". Following this logic, whatever you might think of it, this is how ruby sees the other strings: both '\\\0' and '\\\\0' equal "\\\\0", which (when printed) gives you \\0. As gsub uses \x for inserting match number x, you need a way to escape \x, which is \\x, or in its string representation: "\\\\x".
Therefore the line
puts 'hello'.gsub(/.+/, "\\0 \\\\0 \\\\\\0 \\\\\\\\0")
indeed results in
hello \0 \hello \\0

Related

How to understand ruby gsub "\\\\" return "\" [duplicate]

I don't understand this Ruby code:
>> puts '\\ <- single backslash'
# \ <- single backslash
>> puts '\\ <- 2x a, because 2 backslashes get replaced'.sub(/\\/, 'aa')
# aa <- 2x a, because two backslashes get replaced
so far, all as expected. but if we search for 1 with /\\/, and replace with 2, encoded by '\\\\', why do we get this:
>> puts '\\ <- only 1 ... replace 1 with 2'.sub(/\\/, '\\\\')
# \ <- only 1 backslash, even though we replace 1 with 2
and then, when we encode 3 with '\\\\\\', we only get 2:
>> puts '\\ <- only 2 ... 1 with 3'.sub(/\\/, '\\\\\\')
# \\ <- 2 backslashes, even though we replace 1 with 3
anyone able to understand why a backslash gets swallowed in the replacement string? this happens on 1.8 and 1.9.
Quick Answer
If you want to sidestep all this confusion, use the much less confusing block syntax. Here is an example that replaces each backslash with 2 backslashes:
"some\\path".gsub('\\') { '\\\\' }
Gruesome Details
The problem is that when using sub (and gsub), without a block, ruby interprets special character sequences in the replacement parameter. Unfortunately, sub uses the backslash as the escape character for these:
\& (the entire regex)
\+ (the last group)
\` (pre-match string)
\' (post-match string)
\0 (same as \&)
\1 (first captured group)
\2 (second captured group)
\\ (a backslash)
Like any escaping, this creates an obvious problem. If you want include the literal value of one of the above sequences (e.g. \1) in the output string you have to escape it. So, to get Hello \1, you need the replacement string to be Hello \\1. And to represent this as a string literal in Ruby, you have to escape those backslashes again like this: "Hello \\\\1"
So, there are two different escaping passes. The first one takes the string literal and creates the internal string value. The second takes that internal string value and replaces the sequences above with the matching data.
If a backslash is not followed by a character that matches one of the above sequences, then the backslash (and character that follows) will pass through unaltered. This is also affects a backslash at the end of the string -- it will pass through unaltered. It's easiest to see this logic in the rubinius code; just look for the to_sub_replacement method in the String class.
Here are some examples of how String#sub is parsing the replacement string:
1 backslash \ (which has a string literal of "\\")
Passes through unaltered because the backslash is at the end of the string and has no characters after it.
Result: \
2 backslashes \\ (which have a string literal of "\\\\")
The pair of backslashes match the escaped backslash sequence (see \\ above) and gets converted into a single backslash.
Result: \
3 backslashes \\\ (which have a string literal of "\\\\\\")
The first two backslashes match the \\ sequence and get converted to a single backslash. Then the final backslash is at the end of the string so it passes through unaltered.
Result: \\
4 backslashes \\\\ (which have a string literal of "\\\\\\\\")
Two pairs of backslashes each match the \\ sequence and get converted to a single backslash.
Result: \\
2 backslashes with character in the middle \a\ (which have a string literal of "\\a\\")
The \a does not match any of the escape sequences so it is allowed to pass through unaltered. The trailing backslash is also allowed through.
Result: \a\
Note: The same result could be obtained from: \\a\\ (with the literal string: "\\\\a\\\\")
In hindsight, this could have been less confusing if String#sub had used a different escape character. Then there wouldn't be the need to double escape all the backslashes.
This is an issue because backslash (\) serves as an escape character for Regexps and Strings. You could do use the special variable \& to reduce the number backslashes in the gsub replacement string.
foo.gsub(/\\/,'\&\&\&') #for some string foo replace each \ with \\\
EDIT: I should mention that the value of \& is from a Regexp match, in this case a single backslash.
Also, I thought that there was a special way to create a string that disabled the escape character, but apparently not. None of these will produce two slashes:
puts "\\"
puts '\\'
puts %q{\\}
puts %Q{\\}
puts """\\"""
puts '''\\'''
puts <<EOF
\\
EOF
argh, right after I typed all this out, I realised that \ is used to refer to groups in the replacement string. I guess this means that you need a literal \\ in the replacement string to get one replaced \. To get a literal \\ you need four \s, so to replace one with two you actually need eight(!).
# Double every occurrence of \. There's eight backslashes on the right there!
>> puts '\\'.sub(/\\/, '\\\\\\\\')
anything I'm missing? any more efficient ways?
Clearing up a little confusion on the author's second line of code.
You said:
>> puts '\\ <- 2x a, because 2 backslashes get replaced'.sub(/\\/, 'aa')
# aa <- 2x a, because two backslashes get replaced
2 backslashes aren't getting replaced here. You're replacing 1 escaped backslash with two a's ('aa'). That is, if you used .sub(/\\/, 'a'), you would only see one 'a'
'\\'.sub(/\\/, 'anything') #=> anything
the pickaxe book mentions this exact problem, actually. here's another alternative (from page 130 of the latest edition)
str = 'a\b\c' # => "a\b\c"
str.gsub(/\\/) { '\\\\' } # => "a\\b\\c"

gsub with backslashes not reversible [duplicate]

I don't understand this Ruby code:
>> puts '\\ <- single backslash'
# \ <- single backslash
>> puts '\\ <- 2x a, because 2 backslashes get replaced'.sub(/\\/, 'aa')
# aa <- 2x a, because two backslashes get replaced
so far, all as expected. but if we search for 1 with /\\/, and replace with 2, encoded by '\\\\', why do we get this:
>> puts '\\ <- only 1 ... replace 1 with 2'.sub(/\\/, '\\\\')
# \ <- only 1 backslash, even though we replace 1 with 2
and then, when we encode 3 with '\\\\\\', we only get 2:
>> puts '\\ <- only 2 ... 1 with 3'.sub(/\\/, '\\\\\\')
# \\ <- 2 backslashes, even though we replace 1 with 3
anyone able to understand why a backslash gets swallowed in the replacement string? this happens on 1.8 and 1.9.
Quick Answer
If you want to sidestep all this confusion, use the much less confusing block syntax. Here is an example that replaces each backslash with 2 backslashes:
"some\\path".gsub('\\') { '\\\\' }
Gruesome Details
The problem is that when using sub (and gsub), without a block, ruby interprets special character sequences in the replacement parameter. Unfortunately, sub uses the backslash as the escape character for these:
\& (the entire regex)
\+ (the last group)
\` (pre-match string)
\' (post-match string)
\0 (same as \&)
\1 (first captured group)
\2 (second captured group)
\\ (a backslash)
Like any escaping, this creates an obvious problem. If you want include the literal value of one of the above sequences (e.g. \1) in the output string you have to escape it. So, to get Hello \1, you need the replacement string to be Hello \\1. And to represent this as a string literal in Ruby, you have to escape those backslashes again like this: "Hello \\\\1"
So, there are two different escaping passes. The first one takes the string literal and creates the internal string value. The second takes that internal string value and replaces the sequences above with the matching data.
If a backslash is not followed by a character that matches one of the above sequences, then the backslash (and character that follows) will pass through unaltered. This is also affects a backslash at the end of the string -- it will pass through unaltered. It's easiest to see this logic in the rubinius code; just look for the to_sub_replacement method in the String class.
Here are some examples of how String#sub is parsing the replacement string:
1 backslash \ (which has a string literal of "\\")
Passes through unaltered because the backslash is at the end of the string and has no characters after it.
Result: \
2 backslashes \\ (which have a string literal of "\\\\")
The pair of backslashes match the escaped backslash sequence (see \\ above) and gets converted into a single backslash.
Result: \
3 backslashes \\\ (which have a string literal of "\\\\\\")
The first two backslashes match the \\ sequence and get converted to a single backslash. Then the final backslash is at the end of the string so it passes through unaltered.
Result: \\
4 backslashes \\\\ (which have a string literal of "\\\\\\\\")
Two pairs of backslashes each match the \\ sequence and get converted to a single backslash.
Result: \\
2 backslashes with character in the middle \a\ (which have a string literal of "\\a\\")
The \a does not match any of the escape sequences so it is allowed to pass through unaltered. The trailing backslash is also allowed through.
Result: \a\
Note: The same result could be obtained from: \\a\\ (with the literal string: "\\\\a\\\\")
In hindsight, this could have been less confusing if String#sub had used a different escape character. Then there wouldn't be the need to double escape all the backslashes.
This is an issue because backslash (\) serves as an escape character for Regexps and Strings. You could do use the special variable \& to reduce the number backslashes in the gsub replacement string.
foo.gsub(/\\/,'\&\&\&') #for some string foo replace each \ with \\\
EDIT: I should mention that the value of \& is from a Regexp match, in this case a single backslash.
Also, I thought that there was a special way to create a string that disabled the escape character, but apparently not. None of these will produce two slashes:
puts "\\"
puts '\\'
puts %q{\\}
puts %Q{\\}
puts """\\"""
puts '''\\'''
puts <<EOF
\\
EOF
argh, right after I typed all this out, I realised that \ is used to refer to groups in the replacement string. I guess this means that you need a literal \\ in the replacement string to get one replaced \. To get a literal \\ you need four \s, so to replace one with two you actually need eight(!).
# Double every occurrence of \. There's eight backslashes on the right there!
>> puts '\\'.sub(/\\/, '\\\\\\\\')
anything I'm missing? any more efficient ways?
Clearing up a little confusion on the author's second line of code.
You said:
>> puts '\\ <- 2x a, because 2 backslashes get replaced'.sub(/\\/, 'aa')
# aa <- 2x a, because two backslashes get replaced
2 backslashes aren't getting replaced here. You're replacing 1 escaped backslash with two a's ('aa'). That is, if you used .sub(/\\/, 'a'), you would only see one 'a'
'\\'.sub(/\\/, 'anything') #=> anything
the pickaxe book mentions this exact problem, actually. here's another alternative (from page 130 of the latest edition)
str = 'a\b\c' # => "a\b\c"
str.gsub(/\\/) { '\\\\' } # => "a\\b\\c"

Escape status within a string literal as argument of `String#tr`

There is something mysterious to me about the escape status of a backslash within a single quoted string literal as argument of String#tr. Can you explain the contrast between the three examples below? I particularly do not understand the second one. To avoid complication, I am using 'd' here, which does not change the meaning when escaped in double quotation ("\d" = "d").
'\\'.tr('\\', 'x') #=> "x"
'\\'.tr('\\d', 'x') #=> "\\"
'\\'.tr('\\\d', 'x') #=> "x"
Escaping in tr
The first argument of tr works much like bracket character grouping in regular expressions. You can use ^ in the start of the expression to negate the matching (replace anything that doesn't match) and use e.g. a-f to match a range of characters. Since it has control characters, it also does escaping internally, so you can use - and ^ as literal characters.
print 'abcdef'.tr('b-e', 'x') # axxxxf
print 'abcdef'.tr('b\-e', 'x') # axcdxf
Escaping in Ruby single quote strings
Furthermore, when using single quotes, Ruby tries to include the backslash when possible, i.e. when it's not used to actually escape another backslash or a single quote.
# Single quotes
print '\\' # \
print '\d' # \d
print '\\d' # \d
print '\\\d' # \\d
# Double quotes
print "\\" # \
print "\d" # d
print "\\d" # \d
print "\\\d" # \d
The examples revisited
With all that in mind, let's look at the examples again.
'\\'.tr('\\', 'x') #=> "x"
The string defined as '\\' becomes the literal string \ because the first backslash escapes the second. No surprises there.
'\\'.tr('\\d', 'x') #=> "\\"
The string defined as '\\d' becomes the literal string \d. The tr engine, in turn uses the backslash in the literal string to escape the d. Result: tr replaces instances of d with x.
'\\'.tr('\\\d', 'x') #=> "x"
The string defined as '\\\d' becomes the literal \\d. First \\ becomes \. Then \d becomes \d, i.e. the backslash is preserved. (This particular behavior is different from double strings, where the backslash would be eaten alive, leaving only a lonesome d)
The literal string \\d then makes tr replace all characters that are either a backslash or a d with the replacement string.

Weird backslash substitution in Ruby

I don't understand this Ruby code:
>> puts '\\ <- single backslash'
# \ <- single backslash
>> puts '\\ <- 2x a, because 2 backslashes get replaced'.sub(/\\/, 'aa')
# aa <- 2x a, because two backslashes get replaced
so far, all as expected. but if we search for 1 with /\\/, and replace with 2, encoded by '\\\\', why do we get this:
>> puts '\\ <- only 1 ... replace 1 with 2'.sub(/\\/, '\\\\')
# \ <- only 1 backslash, even though we replace 1 with 2
and then, when we encode 3 with '\\\\\\', we only get 2:
>> puts '\\ <- only 2 ... 1 with 3'.sub(/\\/, '\\\\\\')
# \\ <- 2 backslashes, even though we replace 1 with 3
anyone able to understand why a backslash gets swallowed in the replacement string? this happens on 1.8 and 1.9.
Quick Answer
If you want to sidestep all this confusion, use the much less confusing block syntax. Here is an example that replaces each backslash with 2 backslashes:
"some\\path".gsub('\\') { '\\\\' }
Gruesome Details
The problem is that when using sub (and gsub), without a block, ruby interprets special character sequences in the replacement parameter. Unfortunately, sub uses the backslash as the escape character for these:
\& (the entire regex)
\+ (the last group)
\` (pre-match string)
\' (post-match string)
\0 (same as \&)
\1 (first captured group)
\2 (second captured group)
\\ (a backslash)
Like any escaping, this creates an obvious problem. If you want include the literal value of one of the above sequences (e.g. \1) in the output string you have to escape it. So, to get Hello \1, you need the replacement string to be Hello \\1. And to represent this as a string literal in Ruby, you have to escape those backslashes again like this: "Hello \\\\1"
So, there are two different escaping passes. The first one takes the string literal and creates the internal string value. The second takes that internal string value and replaces the sequences above with the matching data.
If a backslash is not followed by a character that matches one of the above sequences, then the backslash (and character that follows) will pass through unaltered. This is also affects a backslash at the end of the string -- it will pass through unaltered. It's easiest to see this logic in the rubinius code; just look for the to_sub_replacement method in the String class.
Here are some examples of how String#sub is parsing the replacement string:
1 backslash \ (which has a string literal of "\\")
Passes through unaltered because the backslash is at the end of the string and has no characters after it.
Result: \
2 backslashes \\ (which have a string literal of "\\\\")
The pair of backslashes match the escaped backslash sequence (see \\ above) and gets converted into a single backslash.
Result: \
3 backslashes \\\ (which have a string literal of "\\\\\\")
The first two backslashes match the \\ sequence and get converted to a single backslash. Then the final backslash is at the end of the string so it passes through unaltered.
Result: \\
4 backslashes \\\\ (which have a string literal of "\\\\\\\\")
Two pairs of backslashes each match the \\ sequence and get converted to a single backslash.
Result: \\
2 backslashes with character in the middle \a\ (which have a string literal of "\\a\\")
The \a does not match any of the escape sequences so it is allowed to pass through unaltered. The trailing backslash is also allowed through.
Result: \a\
Note: The same result could be obtained from: \\a\\ (with the literal string: "\\\\a\\\\")
In hindsight, this could have been less confusing if String#sub had used a different escape character. Then there wouldn't be the need to double escape all the backslashes.
This is an issue because backslash (\) serves as an escape character for Regexps and Strings. You could do use the special variable \& to reduce the number backslashes in the gsub replacement string.
foo.gsub(/\\/,'\&\&\&') #for some string foo replace each \ with \\\
EDIT: I should mention that the value of \& is from a Regexp match, in this case a single backslash.
Also, I thought that there was a special way to create a string that disabled the escape character, but apparently not. None of these will produce two slashes:
puts "\\"
puts '\\'
puts %q{\\}
puts %Q{\\}
puts """\\"""
puts '''\\'''
puts <<EOF
\\
EOF
argh, right after I typed all this out, I realised that \ is used to refer to groups in the replacement string. I guess this means that you need a literal \\ in the replacement string to get one replaced \. To get a literal \\ you need four \s, so to replace one with two you actually need eight(!).
# Double every occurrence of \. There's eight backslashes on the right there!
>> puts '\\'.sub(/\\/, '\\\\\\\\')
anything I'm missing? any more efficient ways?
Clearing up a little confusion on the author's second line of code.
You said:
>> puts '\\ <- 2x a, because 2 backslashes get replaced'.sub(/\\/, 'aa')
# aa <- 2x a, because two backslashes get replaced
2 backslashes aren't getting replaced here. You're replacing 1 escaped backslash with two a's ('aa'). That is, if you used .sub(/\\/, 'a'), you would only see one 'a'
'\\'.sub(/\\/, 'anything') #=> anything
the pickaxe book mentions this exact problem, actually. here's another alternative (from page 130 of the latest edition)
str = 'a\b\c' # => "a\b\c"
str.gsub(/\\/) { '\\\\' } # => "a\\b\\c"

How to add a single backslash character to a string in Ruby?

I want to insert backslash before apostrophe in "children's world" string. Is there a easy way to do it?
irb(main):035:0> s = "children's world"
=> "children's world"
irb(main):036:0> s.gsub('\'', '\\\'')
=> "childrens worlds world"
Answer
You need some extra backslashes:
>> puts "children's world".gsub("'", '\\\\\'')
children\'s world
or slightly more concisely (since you don't need to escape the ' in a double-quoted string):
>> puts "children's world".gsub("'", "\\\\'")
children\'s world
or even more concisely:
>> puts "children's world".gsub("'") { "\\'" }
children\'s world
Explanation
Your '\\\'' generates \' as a string:
>> puts '\\\''
\'
and \' is a special replacement pattern in Ruby. From ruby-doc.org:
you may refer to some special match variables using these combinations ... \' corresponds to $', which contains string after match
So the \' that gsub sees in the second argument is being interpreted as a special pattern (everything in the original string after the match) instead of as a literal \'.
So what you want gsub to see is actually \\', which can be produced by '\\\\\'' or "\\\\'".
Or, if you use the block form of gsub (gsub("xxx") { "yyy" }) then Ruby takes the replacement string "yyy" literally without trying to apply replacement patterns.
Note: If you have to create a replacement string with a lot of \s you could take advantage of the fact that when you use /.../ (or %r{...}) you don't have to double-escape the backslashes:
>> puts "children's world".gsub("'", /\\'/.source)
children\'s world
Or you could use a single-quoted heredoc: (using <<'STR' instead of just <<STR)
>> puts "children's world".gsub("'", <<'STR'.strip)
\\'
STR
children\'s world
>> puts s.gsub("'", "\\\\'")
children\'s world
Your problem is that the string "\'" is meaningful to gsub in a replacement string. In order to make it work the way you want, you have to use the block form.
s.gsub("'") {"\\'"}

Resources