Ruby regex to split text

Ruby regex to split text - ruby

I am using the below regex to split a text at certain ending punctuation however it doesn't work with quotes.
text = "\"Hello my name is Kevin.\" How are you?"
text.scan(/\S.*?[.．.！!?？]/)
=> ["\"Hello my name is Kevin.", "\" How are you?"]
My goal is to produce the following result, but I am not very good with regex expressions. Any help would be greatly appreciated.
=> ["\"Hello my name is Kevin.\"", "How are you?"]

text.scan(/"(?>[^"\\]+|\\{2}|\\.)*"|\S.*?[.．.！!?？]/)
The idea is to check for quoted parts before. The subpattern is a bit more elaborated than a simple "[^"]*" to deal with escaped quotes (* see at the end to a more efficient pattern).
pattern details:
" # literal: a double quote
(?> # open an atomic group: all that can be between quotes
[^"\\]+ # all that is not a quote or a backslash
| # OR
\\{2} # 2 backslashes (the idea is to skip even numbers of backslashes)
| # OR
\\. # an escaped character (in particular a double quote)
)* # repeat zero or more times the atomic group
" # literal double quote
| # OR
\S.*?[.．.！!?？]
to deal with single quote to you can add: '(?>[^'\\]+|\\{2}|\\.)*'| to the pattern (the most efficient), but if you want make it shorter you can write this:
text.scan(/(['"])(?>[^'"\\]+|\\{2}|\\.|(?!\1)["'])*\1|\S.*?[.．.！!?？]/)
where \1 is a backreference to the first capturing group (the found quote) and (?!\1) means not followed by the found quote.
(*) instead of writing "(?>[^"\\]+|\\{2}|\\.)*", you can use "[^"\\]*+(?:\\.[^"\\]*)*+" that is more efficient.

Add optional quote (["']?) to the pattern:
text.scan(/\S.*?[.．.！!?？]["']?/)
# => ["\"Hello my name is Kevin.\"", "How are you?"]

Related

Replacing ' by \' in Ruby [duplicate]

s = "#main= 'quotes'
s.gsub "'", "\\'" # => "#main= quotes'quotes"
This seems to be wrong, I expect to get "#main= \\'quotes\\'"
when I don't use escape char, then it works as expected.
s.gsub "'", "*" # => "#main= *quotes*"
So there must be something to do with escaping.
Using ruby 1.9.2p290
I need to replace single quotes with back-slash and a quote.
Even more inconsistencies:
"\\'".length # => 2
"\\*".length # => 2
# As expected
"'".gsub("'", "\\*").length # => 2
"'a'".gsub("'", "\\*") # => "\\*a\\*" (length==5)
# WTF next:
"'".gsub("'", "\\'").length # => 0
# Doubling the content?
"'a'".gsub("'", "\\'") # => "a'a" (length==3)
What is going on here?

You're getting tripped up by the specialness of \' inside a regular expression replacement string:
\0, \1, \2, ... \9, \&, \`, \', \+
Substitutes the value matched by the nth grouped subexpression, or by the entire match, pre- or postmatch, or the highest group.
So when you say "\\'", the double \\ becomes just a single backslash and the result is \' but that means "The string to the right of the last successful match." If you want to replace single quotes with escaped single quotes, you need to escape more to get past the specialness of \':
s.gsub("'", "\\\\'")
Or avoid the toothpicks and use the block form:
s.gsub("'") { |m| '\\' + m }
You would run into similar issues if you were trying to escape backticks, a plus sign, or even a single digit.
The overall lesson here is to prefer the block form of gsub for anything but the most trivial of substitutions.

s = "#main = 'quotes'
s.gsub "'", "\\\\'"
Since \it's \\equivalent if you want to get a double backslash you have to put four of ones.

You need to escape the \ as well:
s.gsub "'", "\\\\'"
Outputs
"#main= \\'quotes\\'"
A good explanation found on an outside forum:
The key point to understand IMHO is that a backslash is special in
replacement strings. So, whenever one wants to have a literal
backslash in a replacement string one needs to escape it and hence
have [two] backslashes. Coincidentally a backslash is also special in a
string (even in a single quoted string). So you need two levels of
escaping, makes 2 * 2 = 4 backslashes on the screen for one literal
replacement backslash.
source

How to understand ruby gsub "\\\\" return "\" [duplicate]

I don't understand this Ruby code:
>> puts '\\ <- single backslash'
# \ <- single backslash
>> puts '\\ <- 2x a, because 2 backslashes get replaced'.sub(/\\/, 'aa')
# aa <- 2x a, because two backslashes get replaced
so far, all as expected. but if we search for 1 with /\\/, and replace with 2, encoded by '\\\\', why do we get this:
>> puts '\\ <- only 1 ... replace 1 with 2'.sub(/\\/, '\\\\')
# \ <- only 1 backslash, even though we replace 1 with 2
and then, when we encode 3 with '\\\\\\', we only get 2:
>> puts '\\ <- only 2 ... 1 with 3'.sub(/\\/, '\\\\\\')
# \\ <- 2 backslashes, even though we replace 1 with 3
anyone able to understand why a backslash gets swallowed in the replacement string? this happens on 1.8 and 1.9.

Quick Answer
If you want to sidestep all this confusion, use the much less confusing block syntax. Here is an example that replaces each backslash with 2 backslashes:
"some\\path".gsub('\\') { '\\\\' }
Gruesome Details
The problem is that when using sub (and gsub), without a block, ruby interprets special character sequences in the replacement parameter. Unfortunately, sub uses the backslash as the escape character for these:
\& (the entire regex)
\+ (the last group)
\` (pre-match string)
\' (post-match string)
\0 (same as \&)
\1 (first captured group)
\2 (second captured group)
\\ (a backslash)
Like any escaping, this creates an obvious problem. If you want include the literal value of one of the above sequences (e.g. \1) in the output string you have to escape it. So, to get Hello \1, you need the replacement string to be Hello \\1. And to represent this as a string literal in Ruby, you have to escape those backslashes again like this: "Hello \\\\1"
So, there are two different escaping passes. The first one takes the string literal and creates the internal string value. The second takes that internal string value and replaces the sequences above with the matching data.
If a backslash is not followed by a character that matches one of the above sequences, then the backslash (and character that follows) will pass through unaltered. This is also affects a backslash at the end of the string -- it will pass through unaltered. It's easiest to see this logic in the rubinius code; just look for the to_sub_replacement method in the String class.
Here are some examples of how String#sub is parsing the replacement string:
1 backslash \ (which has a string literal of "\\")
Passes through unaltered because the backslash is at the end of the string and has no characters after it.
Result: \
2 backslashes \\ (which have a string literal of "\\\\")
The pair of backslashes match the escaped backslash sequence (see \\ above) and gets converted into a single backslash.
Result: \
3 backslashes \\\ (which have a string literal of "\\\\\\")
The first two backslashes match the \\ sequence and get converted to a single backslash. Then the final backslash is at the end of the string so it passes through unaltered.
Result: \\
4 backslashes \\\\ (which have a string literal of "\\\\\\\\")
Two pairs of backslashes each match the \\ sequence and get converted to a single backslash.
Result: \\
2 backslashes with character in the middle \a\ (which have a string literal of "\\a\\")
The \a does not match any of the escape sequences so it is allowed to pass through unaltered. The trailing backslash is also allowed through.
Result: \a\
Note: The same result could be obtained from: \\a\\ (with the literal string: "\\\\a\\\\")
In hindsight, this could have been less confusing if String#sub had used a different escape character. Then there wouldn't be the need to double escape all the backslashes.

This is an issue because backslash (\) serves as an escape character for Regexps and Strings. You could do use the special variable \& to reduce the number backslashes in the gsub replacement string.
foo.gsub(/\\/,'\&\&\&') #for some string foo replace each \ with \\\
EDIT: I should mention that the value of \& is from a Regexp match, in this case a single backslash.
Also, I thought that there was a special way to create a string that disabled the escape character, but apparently not. None of these will produce two slashes:
puts "\\"
puts '\\'
puts %q{\\}
puts %Q{\\}
puts """\\"""
puts '''\\'''
puts <<EOF
\\
EOF

argh, right after I typed all this out, I realised that \ is used to refer to groups in the replacement string. I guess this means that you need a literal \\ in the replacement string to get one replaced \. To get a literal \\ you need four \s, so to replace one with two you actually need eight(!).
# Double every occurrence of \. There's eight backslashes on the right there!
>> puts '\\'.sub(/\\/, '\\\\\\\\')
anything I'm missing? any more efficient ways?

Clearing up a little confusion on the author's second line of code.
You said:
>> puts '\\ <- 2x a, because 2 backslashes get replaced'.sub(/\\/, 'aa')
# aa <- 2x a, because two backslashes get replaced
2 backslashes aren't getting replaced here. You're replacing 1 escaped backslash with two a's ('aa'). That is, if you used .sub(/\\/, 'a'), you would only see one 'a'
'\\'.sub(/\\/, 'anything') #=> anything

the pickaxe book mentions this exact problem, actually. here's another alternative (from page 130 of the latest edition)
str = 'a\b\c' # => "a\b\c"
str.gsub(/\\/) { '\\\\' } # => "a\\b\\c"

gsub with backslashes not reversible [duplicate]

I don't understand this Ruby code:
>> puts '\\ <- single backslash'
# \ <- single backslash
>> puts '\\ <- 2x a, because 2 backslashes get replaced'.sub(/\\/, 'aa')
# aa <- 2x a, because two backslashes get replaced
so far, all as expected. but if we search for 1 with /\\/, and replace with 2, encoded by '\\\\', why do we get this:
>> puts '\\ <- only 1 ... replace 1 with 2'.sub(/\\/, '\\\\')
# \ <- only 1 backslash, even though we replace 1 with 2
and then, when we encode 3 with '\\\\\\', we only get 2:
>> puts '\\ <- only 2 ... 1 with 3'.sub(/\\/, '\\\\\\')
# \\ <- 2 backslashes, even though we replace 1 with 3
anyone able to understand why a backslash gets swallowed in the replacement string? this happens on 1.8 and 1.9.

Quick Answer
If you want to sidestep all this confusion, use the much less confusing block syntax. Here is an example that replaces each backslash with 2 backslashes:
"some\\path".gsub('\\') { '\\\\' }
Gruesome Details
The problem is that when using sub (and gsub), without a block, ruby interprets special character sequences in the replacement parameter. Unfortunately, sub uses the backslash as the escape character for these:
\& (the entire regex)
\+ (the last group)
\` (pre-match string)
\' (post-match string)
\0 (same as \&)
\1 (first captured group)
\2 (second captured group)
\\ (a backslash)
Like any escaping, this creates an obvious problem. If you want include the literal value of one of the above sequences (e.g. \1) in the output string you have to escape it. So, to get Hello \1, you need the replacement string to be Hello \\1. And to represent this as a string literal in Ruby, you have to escape those backslashes again like this: "Hello \\\\1"
So, there are two different escaping passes. The first one takes the string literal and creates the internal string value. The second takes that internal string value and replaces the sequences above with the matching data.
If a backslash is not followed by a character that matches one of the above sequences, then the backslash (and character that follows) will pass through unaltered. This is also affects a backslash at the end of the string -- it will pass through unaltered. It's easiest to see this logic in the rubinius code; just look for the to_sub_replacement method in the String class.
Here are some examples of how String#sub is parsing the replacement string:
1 backslash \ (which has a string literal of "\\")
Passes through unaltered because the backslash is at the end of the string and has no characters after it.
Result: \
2 backslashes \\ (which have a string literal of "\\\\")
The pair of backslashes match the escaped backslash sequence (see \\ above) and gets converted into a single backslash.
Result: \
3 backslashes \\\ (which have a string literal of "\\\\\\")
The first two backslashes match the \\ sequence and get converted to a single backslash. Then the final backslash is at the end of the string so it passes through unaltered.
Result: \\
4 backslashes \\\\ (which have a string literal of "\\\\\\\\")
Two pairs of backslashes each match the \\ sequence and get converted to a single backslash.
Result: \\
2 backslashes with character in the middle \a\ (which have a string literal of "\\a\\")
The \a does not match any of the escape sequences so it is allowed to pass through unaltered. The trailing backslash is also allowed through.
Result: \a\
Note: The same result could be obtained from: \\a\\ (with the literal string: "\\\\a\\\\")
In hindsight, this could have been less confusing if String#sub had used a different escape character. Then there wouldn't be the need to double escape all the backslashes.

This is an issue because backslash (\) serves as an escape character for Regexps and Strings. You could do use the special variable \& to reduce the number backslashes in the gsub replacement string.
foo.gsub(/\\/,'\&\&\&') #for some string foo replace each \ with \\\
EDIT: I should mention that the value of \& is from a Regexp match, in this case a single backslash.
Also, I thought that there was a special way to create a string that disabled the escape character, but apparently not. None of these will produce two slashes:
puts "\\"
puts '\\'
puts %q{\\}
puts %Q{\\}
puts """\\"""
puts '''\\'''
puts <<EOF
\\
EOF

argh, right after I typed all this out, I realised that \ is used to refer to groups in the replacement string. I guess this means that you need a literal \\ in the replacement string to get one replaced \. To get a literal \\ you need four \s, so to replace one with two you actually need eight(!).
# Double every occurrence of \. There's eight backslashes on the right there!
>> puts '\\'.sub(/\\/, '\\\\\\\\')
anything I'm missing? any more efficient ways?

Clearing up a little confusion on the author's second line of code.
You said:
>> puts '\\ <- 2x a, because 2 backslashes get replaced'.sub(/\\/, 'aa')
# aa <- 2x a, because two backslashes get replaced
2 backslashes aren't getting replaced here. You're replacing 1 escaped backslash with two a's ('aa'). That is, if you used .sub(/\\/, 'a'), you would only see one 'a'
'\\'.sub(/\\/, 'anything') #=> anything

the pickaxe book mentions this exact problem, actually. here's another alternative (from page 130 of the latest edition)
str = 'a\b\c' # => "a\b\c"
str.gsub(/\\/) { '\\\\' } # => "a\\b\\c"

Unexpected behavior with ruby gsub and '\\' [duplicate]

s = "#main= 'quotes'
s.gsub "'", "\\'" # => "#main= quotes'quotes"
This seems to be wrong, I expect to get "#main= \\'quotes\\'"
when I don't use escape char, then it works as expected.
s.gsub "'", "*" # => "#main= *quotes*"
So there must be something to do with escaping.
Using ruby 1.9.2p290
I need to replace single quotes with back-slash and a quote.
Even more inconsistencies:
"\\'".length # => 2
"\\*".length # => 2
# As expected
"'".gsub("'", "\\*").length # => 2
"'a'".gsub("'", "\\*") # => "\\*a\\*" (length==5)
# WTF next:
"'".gsub("'", "\\'").length # => 0
# Doubling the content?
"'a'".gsub("'", "\\'") # => "a'a" (length==3)
What is going on here?

You're getting tripped up by the specialness of \' inside a regular expression replacement string:
\0, \1, \2, ... \9, \&, \`, \', \+
Substitutes the value matched by the nth grouped subexpression, or by the entire match, pre- or postmatch, or the highest group.
So when you say "\\'", the double \\ becomes just a single backslash and the result is \' but that means "The string to the right of the last successful match." If you want to replace single quotes with escaped single quotes, you need to escape more to get past the specialness of \':
s.gsub("'", "\\\\'")
Or avoid the toothpicks and use the block form:
s.gsub("'") { |m| '\\' + m }
You would run into similar issues if you were trying to escape backticks, a plus sign, or even a single digit.
The overall lesson here is to prefer the block form of gsub for anything but the most trivial of substitutions.

s = "#main = 'quotes'
s.gsub "'", "\\\\'"
Since \it's \\equivalent if you want to get a double backslash you have to put four of ones.

You need to escape the \ as well:
s.gsub "'", "\\\\'"
Outputs
"#main= \\'quotes\\'"
A good explanation found on an outside forum:
The key point to understand IMHO is that a backslash is special in
replacement strings. So, whenever one wants to have a literal
backslash in a replacement string one needs to escape it and hence
have [two] backslashes. Coincidentally a backslash is also special in a
string (even in a single quoted string). So you need two levels of
escaping, makes 2 * 2 = 4 backslashes on the screen for one literal
replacement backslash.
source

Weird backslash substitution in Ruby

I don't understand this Ruby code:
>> puts '\\ <- single backslash'
# \ <- single backslash
>> puts '\\ <- 2x a, because 2 backslashes get replaced'.sub(/\\/, 'aa')
# aa <- 2x a, because two backslashes get replaced
so far, all as expected. but if we search for 1 with /\\/, and replace with 2, encoded by '\\\\', why do we get this:
>> puts '\\ <- only 1 ... replace 1 with 2'.sub(/\\/, '\\\\')
# \ <- only 1 backslash, even though we replace 1 with 2
and then, when we encode 3 with '\\\\\\', we only get 2:
>> puts '\\ <- only 2 ... 1 with 3'.sub(/\\/, '\\\\\\')
# \\ <- 2 backslashes, even though we replace 1 with 3
anyone able to understand why a backslash gets swallowed in the replacement string? this happens on 1.8 and 1.9.

Quick Answer
If you want to sidestep all this confusion, use the much less confusing block syntax. Here is an example that replaces each backslash with 2 backslashes:
"some\\path".gsub('\\') { '\\\\' }
Gruesome Details
The problem is that when using sub (and gsub), without a block, ruby interprets special character sequences in the replacement parameter. Unfortunately, sub uses the backslash as the escape character for these:
\& (the entire regex)
\+ (the last group)
\` (pre-match string)
\' (post-match string)
\0 (same as \&)
\1 (first captured group)
\2 (second captured group)
\\ (a backslash)
Like any escaping, this creates an obvious problem. If you want include the literal value of one of the above sequences (e.g. \1) in the output string you have to escape it. So, to get Hello \1, you need the replacement string to be Hello \\1. And to represent this as a string literal in Ruby, you have to escape those backslashes again like this: "Hello \\\\1"
So, there are two different escaping passes. The first one takes the string literal and creates the internal string value. The second takes that internal string value and replaces the sequences above with the matching data.
If a backslash is not followed by a character that matches one of the above sequences, then the backslash (and character that follows) will pass through unaltered. This is also affects a backslash at the end of the string -- it will pass through unaltered. It's easiest to see this logic in the rubinius code; just look for the to_sub_replacement method in the String class.
Here are some examples of how String#sub is parsing the replacement string:
1 backslash \ (which has a string literal of "\\")
Passes through unaltered because the backslash is at the end of the string and has no characters after it.
Result: \
2 backslashes \\ (which have a string literal of "\\\\")
The pair of backslashes match the escaped backslash sequence (see \\ above) and gets converted into a single backslash.
Result: \
3 backslashes \\\ (which have a string literal of "\\\\\\")
The first two backslashes match the \\ sequence and get converted to a single backslash. Then the final backslash is at the end of the string so it passes through unaltered.
Result: \\
4 backslashes \\\\ (which have a string literal of "\\\\\\\\")
Two pairs of backslashes each match the \\ sequence and get converted to a single backslash.
Result: \\
2 backslashes with character in the middle \a\ (which have a string literal of "\\a\\")
The \a does not match any of the escape sequences so it is allowed to pass through unaltered. The trailing backslash is also allowed through.
Result: \a\
Note: The same result could be obtained from: \\a\\ (with the literal string: "\\\\a\\\\")
In hindsight, this could have been less confusing if String#sub had used a different escape character. Then there wouldn't be the need to double escape all the backslashes.

This is an issue because backslash (\) serves as an escape character for Regexps and Strings. You could do use the special variable \& to reduce the number backslashes in the gsub replacement string.
foo.gsub(/\\/,'\&\&\&') #for some string foo replace each \ with \\\
EDIT: I should mention that the value of \& is from a Regexp match, in this case a single backslash.
Also, I thought that there was a special way to create a string that disabled the escape character, but apparently not. None of these will produce two slashes:
puts "\\"
puts '\\'
puts %q{\\}
puts %Q{\\}
puts """\\"""
puts '''\\'''
puts <<EOF
\\
EOF

argh, right after I typed all this out, I realised that \ is used to refer to groups in the replacement string. I guess this means that you need a literal \\ in the replacement string to get one replaced \. To get a literal \\ you need four \s, so to replace one with two you actually need eight(!).
# Double every occurrence of \. There's eight backslashes on the right there!
>> puts '\\'.sub(/\\/, '\\\\\\\\')
anything I'm missing? any more efficient ways?

Clearing up a little confusion on the author's second line of code.
You said:
>> puts '\\ <- 2x a, because 2 backslashes get replaced'.sub(/\\/, 'aa')
# aa <- 2x a, because two backslashes get replaced
2 backslashes aren't getting replaced here. You're replacing 1 escaped backslash with two a's ('aa'). That is, if you used .sub(/\\/, 'a'), you would only see one 'a'
'\\'.sub(/\\/, 'anything') #=> anything

the pickaxe book mentions this exact problem, actually. here's another alternative (from page 130 of the latest edition)
str = 'a\b\c' # => "a\b\c"
str.gsub(/\\/) { '\\\\' } # => "a\\b\\c"

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Ruby regex to split text - ruby

Add optional quote (["']?) to the pattern: text.scan(/\S.*?[.．.！!?？]["']?/) # => ["\"Hello my name is Kevin.\"", "How are you?"]

Related

Replacing ' by \' in Ruby [duplicate]

How to understand ruby gsub "\\\\" return "\" [duplicate]

gsub with backslashes not reversible [duplicate]

Unexpected behavior with ruby gsub and '\\' [duplicate]

Weird backslash substitution in Ruby

Categories

Resources