Gsub causing part of string to be substituted - ruby

I want to replace all occurrences of a single quote (') with backslash single quote (\'). I tried doing this with gsub, but I'm getting partial string duplication:
a = "abc 'def' ghi"
a.gsub("'", "\\'")
# => "abc def' ghidef ghi ghi"
Can someone explain why this happens and what a solution to this is?

It happens because "\\'" has a special meaning when it occurs as the replacement argument of gsub, namely it means the post-match substring.
To do what you want, you can use a block:
a.gsub("'"){"\\'"}
# => "abc \\'def\\' ghi"
Notice that the backslash is escaped in the string inspection, so it appears as \\.

Your "\\'" actually represents a literal \' because of the backslash escaping the next backslash. And that literal \' in Ruby regex is actually a special variable that interpolates to the part of the string that follows the matched portion. So here's what's happening.
abc 'def' ghi
^
The caret points to the first match, '. Replace it with everything to its right, i.e. def' ghi.
abc def' ghidef' ghi
++++++++
Now find the next match:
abc def' ghidef' ghi
^
Once again, replace the ' with everything to its right, i.e. ghi.
abc def' ghidef ghi ghi
++++

It's possible you just need a higher dose of escaping:
a.gsub(/'/, "\\\\'" )
Result:
abc \'def\' ghi

Related

How to understand ruby gsub "\\\\" return "\" [duplicate]

I don't understand this Ruby code:
>> puts '\\ <- single backslash'
# \ <- single backslash
>> puts '\\ <- 2x a, because 2 backslashes get replaced'.sub(/\\/, 'aa')
# aa <- 2x a, because two backslashes get replaced
so far, all as expected. but if we search for 1 with /\\/, and replace with 2, encoded by '\\\\', why do we get this:
>> puts '\\ <- only 1 ... replace 1 with 2'.sub(/\\/, '\\\\')
# \ <- only 1 backslash, even though we replace 1 with 2
and then, when we encode 3 with '\\\\\\', we only get 2:
>> puts '\\ <- only 2 ... 1 with 3'.sub(/\\/, '\\\\\\')
# \\ <- 2 backslashes, even though we replace 1 with 3
anyone able to understand why a backslash gets swallowed in the replacement string? this happens on 1.8 and 1.9.
Quick Answer
If you want to sidestep all this confusion, use the much less confusing block syntax. Here is an example that replaces each backslash with 2 backslashes:
"some\\path".gsub('\\') { '\\\\' }
Gruesome Details
The problem is that when using sub (and gsub), without a block, ruby interprets special character sequences in the replacement parameter. Unfortunately, sub uses the backslash as the escape character for these:
\& (the entire regex)
\+ (the last group)
\` (pre-match string)
\' (post-match string)
\0 (same as \&)
\1 (first captured group)
\2 (second captured group)
\\ (a backslash)
Like any escaping, this creates an obvious problem. If you want include the literal value of one of the above sequences (e.g. \1) in the output string you have to escape it. So, to get Hello \1, you need the replacement string to be Hello \\1. And to represent this as a string literal in Ruby, you have to escape those backslashes again like this: "Hello \\\\1"
So, there are two different escaping passes. The first one takes the string literal and creates the internal string value. The second takes that internal string value and replaces the sequences above with the matching data.
If a backslash is not followed by a character that matches one of the above sequences, then the backslash (and character that follows) will pass through unaltered. This is also affects a backslash at the end of the string -- it will pass through unaltered. It's easiest to see this logic in the rubinius code; just look for the to_sub_replacement method in the String class.
Here are some examples of how String#sub is parsing the replacement string:
1 backslash \ (which has a string literal of "\\")
Passes through unaltered because the backslash is at the end of the string and has no characters after it.
Result: \
2 backslashes \\ (which have a string literal of "\\\\")
The pair of backslashes match the escaped backslash sequence (see \\ above) and gets converted into a single backslash.
Result: \
3 backslashes \\\ (which have a string literal of "\\\\\\")
The first two backslashes match the \\ sequence and get converted to a single backslash. Then the final backslash is at the end of the string so it passes through unaltered.
Result: \\
4 backslashes \\\\ (which have a string literal of "\\\\\\\\")
Two pairs of backslashes each match the \\ sequence and get converted to a single backslash.
Result: \\
2 backslashes with character in the middle \a\ (which have a string literal of "\\a\\")
The \a does not match any of the escape sequences so it is allowed to pass through unaltered. The trailing backslash is also allowed through.
Result: \a\
Note: The same result could be obtained from: \\a\\ (with the literal string: "\\\\a\\\\")
In hindsight, this could have been less confusing if String#sub had used a different escape character. Then there wouldn't be the need to double escape all the backslashes.
This is an issue because backslash (\) serves as an escape character for Regexps and Strings. You could do use the special variable \& to reduce the number backslashes in the gsub replacement string.
foo.gsub(/\\/,'\&\&\&') #for some string foo replace each \ with \\\
EDIT: I should mention that the value of \& is from a Regexp match, in this case a single backslash.
Also, I thought that there was a special way to create a string that disabled the escape character, but apparently not. None of these will produce two slashes:
puts "\\"
puts '\\'
puts %q{\\}
puts %Q{\\}
puts """\\"""
puts '''\\'''
puts <<EOF
\\
EOF
argh, right after I typed all this out, I realised that \ is used to refer to groups in the replacement string. I guess this means that you need a literal \\ in the replacement string to get one replaced \. To get a literal \\ you need four \s, so to replace one with two you actually need eight(!).
# Double every occurrence of \. There's eight backslashes on the right there!
>> puts '\\'.sub(/\\/, '\\\\\\\\')
anything I'm missing? any more efficient ways?
Clearing up a little confusion on the author's second line of code.
You said:
>> puts '\\ <- 2x a, because 2 backslashes get replaced'.sub(/\\/, 'aa')
# aa <- 2x a, because two backslashes get replaced
2 backslashes aren't getting replaced here. You're replacing 1 escaped backslash with two a's ('aa'). That is, if you used .sub(/\\/, 'a'), you would only see one 'a'
'\\'.sub(/\\/, 'anything') #=> anything
the pickaxe book mentions this exact problem, actually. here's another alternative (from page 130 of the latest edition)
str = 'a\b\c' # => "a\b\c"
str.gsub(/\\/) { '\\\\' } # => "a\\b\\c"

gsub with backslashes not reversible [duplicate]

I don't understand this Ruby code:
>> puts '\\ <- single backslash'
# \ <- single backslash
>> puts '\\ <- 2x a, because 2 backslashes get replaced'.sub(/\\/, 'aa')
# aa <- 2x a, because two backslashes get replaced
so far, all as expected. but if we search for 1 with /\\/, and replace with 2, encoded by '\\\\', why do we get this:
>> puts '\\ <- only 1 ... replace 1 with 2'.sub(/\\/, '\\\\')
# \ <- only 1 backslash, even though we replace 1 with 2
and then, when we encode 3 with '\\\\\\', we only get 2:
>> puts '\\ <- only 2 ... 1 with 3'.sub(/\\/, '\\\\\\')
# \\ <- 2 backslashes, even though we replace 1 with 3
anyone able to understand why a backslash gets swallowed in the replacement string? this happens on 1.8 and 1.9.
Quick Answer
If you want to sidestep all this confusion, use the much less confusing block syntax. Here is an example that replaces each backslash with 2 backslashes:
"some\\path".gsub('\\') { '\\\\' }
Gruesome Details
The problem is that when using sub (and gsub), without a block, ruby interprets special character sequences in the replacement parameter. Unfortunately, sub uses the backslash as the escape character for these:
\& (the entire regex)
\+ (the last group)
\` (pre-match string)
\' (post-match string)
\0 (same as \&)
\1 (first captured group)
\2 (second captured group)
\\ (a backslash)
Like any escaping, this creates an obvious problem. If you want include the literal value of one of the above sequences (e.g. \1) in the output string you have to escape it. So, to get Hello \1, you need the replacement string to be Hello \\1. And to represent this as a string literal in Ruby, you have to escape those backslashes again like this: "Hello \\\\1"
So, there are two different escaping passes. The first one takes the string literal and creates the internal string value. The second takes that internal string value and replaces the sequences above with the matching data.
If a backslash is not followed by a character that matches one of the above sequences, then the backslash (and character that follows) will pass through unaltered. This is also affects a backslash at the end of the string -- it will pass through unaltered. It's easiest to see this logic in the rubinius code; just look for the to_sub_replacement method in the String class.
Here are some examples of how String#sub is parsing the replacement string:
1 backslash \ (which has a string literal of "\\")
Passes through unaltered because the backslash is at the end of the string and has no characters after it.
Result: \
2 backslashes \\ (which have a string literal of "\\\\")
The pair of backslashes match the escaped backslash sequence (see \\ above) and gets converted into a single backslash.
Result: \
3 backslashes \\\ (which have a string literal of "\\\\\\")
The first two backslashes match the \\ sequence and get converted to a single backslash. Then the final backslash is at the end of the string so it passes through unaltered.
Result: \\
4 backslashes \\\\ (which have a string literal of "\\\\\\\\")
Two pairs of backslashes each match the \\ sequence and get converted to a single backslash.
Result: \\
2 backslashes with character in the middle \a\ (which have a string literal of "\\a\\")
The \a does not match any of the escape sequences so it is allowed to pass through unaltered. The trailing backslash is also allowed through.
Result: \a\
Note: The same result could be obtained from: \\a\\ (with the literal string: "\\\\a\\\\")
In hindsight, this could have been less confusing if String#sub had used a different escape character. Then there wouldn't be the need to double escape all the backslashes.
This is an issue because backslash (\) serves as an escape character for Regexps and Strings. You could do use the special variable \& to reduce the number backslashes in the gsub replacement string.
foo.gsub(/\\/,'\&\&\&') #for some string foo replace each \ with \\\
EDIT: I should mention that the value of \& is from a Regexp match, in this case a single backslash.
Also, I thought that there was a special way to create a string that disabled the escape character, but apparently not. None of these will produce two slashes:
puts "\\"
puts '\\'
puts %q{\\}
puts %Q{\\}
puts """\\"""
puts '''\\'''
puts <<EOF
\\
EOF
argh, right after I typed all this out, I realised that \ is used to refer to groups in the replacement string. I guess this means that you need a literal \\ in the replacement string to get one replaced \. To get a literal \\ you need four \s, so to replace one with two you actually need eight(!).
# Double every occurrence of \. There's eight backslashes on the right there!
>> puts '\\'.sub(/\\/, '\\\\\\\\')
anything I'm missing? any more efficient ways?
Clearing up a little confusion on the author's second line of code.
You said:
>> puts '\\ <- 2x a, because 2 backslashes get replaced'.sub(/\\/, 'aa')
# aa <- 2x a, because two backslashes get replaced
2 backslashes aren't getting replaced here. You're replacing 1 escaped backslash with two a's ('aa'). That is, if you used .sub(/\\/, 'a'), you would only see one 'a'
'\\'.sub(/\\/, 'anything') #=> anything
the pickaxe book mentions this exact problem, actually. here's another alternative (from page 130 of the latest edition)
str = 'a\b\c' # => "a\b\c"
str.gsub(/\\/) { '\\\\' } # => "a\\b\\c"

How to use sed command to add a string before a pattern string?

I want to use sed to modify my file named "baz".
When i search a pattern foo , foo is not at the beginning or end of line, i want to append bar before foo, how can i do it using sed?
Input file named baz:
blah_foo_blahblahblah
blah_foo_blahblahblah
blah_foo_blahblahblah
blah_foo_blahblahblah
Output file
blah_barfoo_blahblahblah
blah_barfoo_blahblahblah
blah_barfoo_blahblahblah
blah_barfoo_blahblahblah
You can just use something like:
sed 's/foo/barfoo/g' baz
(the g at the end means global, every occurrence on each line rather than just the first).
For an arbitrary (rather than fixed) pattern such as foo[0-9], you could use capture groups as follows:
pax$ echo 'xyz fooA abc
xyz foo5 abc
xyz fooB abc' | sed 's/\(foo[0-9]\)/bar\1/g'
xyz fooA abc
xyz barfoo5 abc
xyz fooB abc
The parentheses capture the actual text that matched the pattern and the \1 uses it in the substitution.
You can use arbitrarily complex patterns with this one, including ensuring you match only complete words. For example, only changing the pattern if it's immediately surrounded by a word boundary:
pax$ echo 'xyz fooA abc
xyz foo5 abc foo77 qqq xfoo4 zzz
xyz fooB abc' | sed 's/\(\bfoo[0-9]\b\)/bar\1/g'
xyz fooA abc
xyz barfoo5 abc foo77 qqq xfoo4 zzz
xyz fooB abc
In terms of how the capture groups work, you can use parentheses to store the text that matches a pattern for later use in the replacement. The captured identifiers are based on the ( characters reading from left to right, so the regex (I've left off the \ escape characters and padded it a bit for clarity):
( ( \S* ) ( \S* ) )
^ ^ ^ ^ ^ ^
| | | | | |
| +--2--+ +--3--+ |
+---------1---------+
when applied to the text Pax Diablo would give you three groups:
\1 = Pax Diablo
\2 = Pax
\3 = Diablo
as shown below:
pax$ echo 'Pax Diablo' | sed 's/\(\(\S*\) \(\S*\)\)/[\1] [\2] [\3]/'
[Pax Diablo] [Pax] [Diablo]
Just substitute the start of the line with something different.
sed '/^foo/s/^/bar/'
To replace or modify all "foo" except at beginning or end of line, I would suggest to temporarily replace them at beginning and end of line with a unique sentinel value.
sed 's/^foo/____veryunlikelytoken_bol____/
s/foo$/____veryunlikelytoken_eol____/
s/foo/bar&/g
s/^____veryunlikelytoken_bol____/foo/
s/____veryunlikelytoken_eol____$/foo/'
In sed there is no way to specify "cannot match here". In Perl regex and derivatives (meaning languages which borrowed from Perl's regex, not necessarily languages derived from Perl) you have various negative assertions so you can do something like
perl -pe 's/(?!^)foo(?!$)/barfoo/g'

How do I use string.tr to replace double quotes with single ones in Ruby?

How do I use string.tr to replace double quotes with single ones in Ruby?
'abc "def" ghi'.tr('"', "'") # => abc 'def' ghi
Besides tr, you can also use gsub
irb(main):001:0> 'abc "def" ghi'.gsub(/"/,"'")
=> "abc 'def' ghi"

Weird backslash substitution in Ruby

I don't understand this Ruby code:
>> puts '\\ <- single backslash'
# \ <- single backslash
>> puts '\\ <- 2x a, because 2 backslashes get replaced'.sub(/\\/, 'aa')
# aa <- 2x a, because two backslashes get replaced
so far, all as expected. but if we search for 1 with /\\/, and replace with 2, encoded by '\\\\', why do we get this:
>> puts '\\ <- only 1 ... replace 1 with 2'.sub(/\\/, '\\\\')
# \ <- only 1 backslash, even though we replace 1 with 2
and then, when we encode 3 with '\\\\\\', we only get 2:
>> puts '\\ <- only 2 ... 1 with 3'.sub(/\\/, '\\\\\\')
# \\ <- 2 backslashes, even though we replace 1 with 3
anyone able to understand why a backslash gets swallowed in the replacement string? this happens on 1.8 and 1.9.
Quick Answer
If you want to sidestep all this confusion, use the much less confusing block syntax. Here is an example that replaces each backslash with 2 backslashes:
"some\\path".gsub('\\') { '\\\\' }
Gruesome Details
The problem is that when using sub (and gsub), without a block, ruby interprets special character sequences in the replacement parameter. Unfortunately, sub uses the backslash as the escape character for these:
\& (the entire regex)
\+ (the last group)
\` (pre-match string)
\' (post-match string)
\0 (same as \&)
\1 (first captured group)
\2 (second captured group)
\\ (a backslash)
Like any escaping, this creates an obvious problem. If you want include the literal value of one of the above sequences (e.g. \1) in the output string you have to escape it. So, to get Hello \1, you need the replacement string to be Hello \\1. And to represent this as a string literal in Ruby, you have to escape those backslashes again like this: "Hello \\\\1"
So, there are two different escaping passes. The first one takes the string literal and creates the internal string value. The second takes that internal string value and replaces the sequences above with the matching data.
If a backslash is not followed by a character that matches one of the above sequences, then the backslash (and character that follows) will pass through unaltered. This is also affects a backslash at the end of the string -- it will pass through unaltered. It's easiest to see this logic in the rubinius code; just look for the to_sub_replacement method in the String class.
Here are some examples of how String#sub is parsing the replacement string:
1 backslash \ (which has a string literal of "\\")
Passes through unaltered because the backslash is at the end of the string and has no characters after it.
Result: \
2 backslashes \\ (which have a string literal of "\\\\")
The pair of backslashes match the escaped backslash sequence (see \\ above) and gets converted into a single backslash.
Result: \
3 backslashes \\\ (which have a string literal of "\\\\\\")
The first two backslashes match the \\ sequence and get converted to a single backslash. Then the final backslash is at the end of the string so it passes through unaltered.
Result: \\
4 backslashes \\\\ (which have a string literal of "\\\\\\\\")
Two pairs of backslashes each match the \\ sequence and get converted to a single backslash.
Result: \\
2 backslashes with character in the middle \a\ (which have a string literal of "\\a\\")
The \a does not match any of the escape sequences so it is allowed to pass through unaltered. The trailing backslash is also allowed through.
Result: \a\
Note: The same result could be obtained from: \\a\\ (with the literal string: "\\\\a\\\\")
In hindsight, this could have been less confusing if String#sub had used a different escape character. Then there wouldn't be the need to double escape all the backslashes.
This is an issue because backslash (\) serves as an escape character for Regexps and Strings. You could do use the special variable \& to reduce the number backslashes in the gsub replacement string.
foo.gsub(/\\/,'\&\&\&') #for some string foo replace each \ with \\\
EDIT: I should mention that the value of \& is from a Regexp match, in this case a single backslash.
Also, I thought that there was a special way to create a string that disabled the escape character, but apparently not. None of these will produce two slashes:
puts "\\"
puts '\\'
puts %q{\\}
puts %Q{\\}
puts """\\"""
puts '''\\'''
puts <<EOF
\\
EOF
argh, right after I typed all this out, I realised that \ is used to refer to groups in the replacement string. I guess this means that you need a literal \\ in the replacement string to get one replaced \. To get a literal \\ you need four \s, so to replace one with two you actually need eight(!).
# Double every occurrence of \. There's eight backslashes on the right there!
>> puts '\\'.sub(/\\/, '\\\\\\\\')
anything I'm missing? any more efficient ways?
Clearing up a little confusion on the author's second line of code.
You said:
>> puts '\\ <- 2x a, because 2 backslashes get replaced'.sub(/\\/, 'aa')
# aa <- 2x a, because two backslashes get replaced
2 backslashes aren't getting replaced here. You're replacing 1 escaped backslash with two a's ('aa'). That is, if you used .sub(/\\/, 'a'), you would only see one 'a'
'\\'.sub(/\\/, 'anything') #=> anything
the pickaxe book mentions this exact problem, actually. here's another alternative (from page 130 of the latest edition)
str = 'a\b\c' # => "a\b\c"
str.gsub(/\\/) { '\\\\' } # => "a\\b\\c"

Resources