Regular Expression lookahead / lookback for punctuation patterns - ruby

Using Ruby, I want to find a regular expression that correctly identifies sentence boundaries, which I am defining as any string that ends in [.!?] except when these punctuation marks exist within quotation marks, as in
My friend said "John isn't here!" and then he left.
My current code that is falling short is:
text = para.text.scan(/[^\.!?]+[(?<!(.?!)\"|.!?] /).map(&:strip)
I've pored over the regex docs, but still can't seem to understand lookbacks/lookaheads correctly.

How about something like this?
/(?:"(?>[^"]|\\.)+"|[a-z]\.[a-z]\.|[^.?!])+[!.?]/gi
Demo: https://regex101.com/r/bJ8hM5/2
How it works:
The regex, will at each position in the string, check for the following
A quoted string in the form of "quote" which can contain anything up until the ending quote. You can also have escaped quotes, such as "hell\"o".
Match any letter, followed by a dot, followed by another letter, and finally a dot. This is to match your special case of U.S. etc.
Match everything else that isn't a punctation character .?!.
Repeat up until we reach a punctation character.

Here's a partial-regex solution that disregards sentence terminators that are contained between double-quotes.
Code
def extract_sentences(str, da_terminators)
start_with_quote = (str[0] == '"')
str.split(/(\".*?\")/)
.flat_map.with_index { |b,i|
(start_with_quote == i.even?) ? b : b.split(/([#{da_terminators}])/) }
.slice_after(/^[#{da_terminators}]$/)
.map { |sb| sb.join.strip }
end
Example
puts extract_sentences(str, '!?.')
# My friend said "John isn't here!", then "I'm outta' here" and then he left.
# Let's go!
# Later, he said "Aren't you coming?"
Explanation
For str above and
da_terminators = '!?.'
We will need the following later:
start_with_quote = (str[0] == '"')
#=> false
Split the string on "...". We need to make \".*?\" a capture group in order to keep it in the split. The result is an array, block that alternately has strings surrounded by double quotes and other strings. start_with_quote tells us which is which.
blocks = str.split(/(\".*?\")/)
#=> ["My friend said ",
# "\"John isn't here!\"",
# ", then ",
# "\"I'm outta' here\"",
# " and then he left. Let's go! Later, he said ",
# "\"Aren't you coming?\""]
Split the string elements that are not surrounded by double quotes. The split is on any of the sentence terminating characters. Again it must be in a capture group in order to keep the separator.
new_blocks = blocks.flat_map.with_index { |b,i|
(start_with_quote == i.even?) ? b : b.split(/([#{da_terminators}])/) }
#=> ["My friend said ",
# "\"John isn't here!\"",
# ", then ",
# "\"I'm outta' here\"",
# " and then he left",
# ".",
# " Let's go",
# "!",
# " Later, he said ",
# "\"Aren't you coming?\""
sentence_blocks_enum = new_blocks.slice_after(/^[#{da_terminators}]$/)
# #<Enumerator:0x007f9a3b853478>
Convert this enumerator to an array to see what it will pass into its block:
sentence_blocks_enum.to_a
#=> [["My friend said ",
# "\"John isn't here!\"",
# ", then ",
# "\"I'm outta' here\"",
# " and then he left", "."],
# [" Let's go", "!"],
# [" Later, he said ", "\"Aren't you coming?\""]]
Combine the blocks of each sentence and strip whitespace, and return the array:
sentence_blocks_enum.map { |sb| sb.join.strip }
#=> ["My friend said \"John isn't here!\", then \"I'm outta' here\" and then he left.",
# "Let's go!",
# "Later, he said \"Aren't you coming?\""]

Related

Split string into chunks of maximum character count without breaking words

I want to split a string into chunks, each of which is within a maximum character count, say 2000 and does not split a word.
I have tried doing as below:
text.chars.each_slice(2000).map(&:join)
but sometimes, words are split. I have tried some regex:
text.scan(/.{1,2000}\b|.{1,2000}/).map(&:strip)
from this question, but I don't quite get how it works and it gives me some erratic behavior, sometimes giving chunks that only contain periods.
Any pointers will be greatly appreciated.
Code
def max_groups(str, n)
arr = []
pos = 0
loop do
break arr if pos == str.size
m = str.match(/.{1,#{n}}(?=[ ]|\z)|.{,#{n-1}}[ ]/, pos)
return nil if m.nil?
arr << m[0]
pos += m[0].size
end
end
Examples
str = "Now is the time for all good people to party"
# 12345678901234567890123456789012345678901234
# 0 1 2 3 4
max_groups(str, 5)
#=> nil
max_groups(str, 6)
#=> ["Now is", " the ", "time ", "for ", "all ", "good ", "people", " to
max_groups(str, 10)
#=> ["Now is the", " time for ", "all good ", "people to ", "party"]
max_groups(str, 14)
#=> ["Now is the ", "time for all ", "good people to", " party"]
max_groups(str, 15)
#=> ["Now is the time", " for all good ", "people to party"]
max_groups(str, 29)
#=> ["Now is the time for all good ", "people to party"]
max_groups(str, 43)
#=> ["Now is the time for all good people to ", "party"]
max_groups(str, 44)
#=> ["Now is the time for all good people to party"]
str = "How you do?"
# 123456789012345678
# 0 1
max_groups(str, 4)
#=> ["How ", " ", " ", "you ", "do?"]
You could do a Notepad style word wrap.
Just construct the regex using the maximum characters per line quantifier range {1,N}.
The example below uses 32 max per line.
https://regex101.com/r/8vAkOX/1
Update: To include linebreaks within the range, add the dot-all modifier (?s)
Otherwise, stand alone linebreaks are filtered.
(?s)(?:((?>.{1,32}(?:(?<=[^\S\r\n])[^\S\r\n]?|(?=\r?\n)|$|[^\S\r\n]))|.{1,32})(?:\r?\n)?|(?:\r?\n|$))
The chunks are in $1, and you could replace with $1\r\n to get a display
that looks wrapped.
Explained
(?s) # Span line breaks
(?:
# -- Words/Characters
( # (1 start)
(?> # Atomic Group - Match words with valid breaks
.{1,32} # 1-N characters
# Followed by one of 4 prioritized, non-linebreak whitespace
(?: # break types:
(?<= [^\S\r\n] ) # 1. - Behind a non-linebreak whitespace
[^\S\r\n]? # ( optionally accept an extra non-linebreak whitespace )
| (?= \r? \n ) # 2. - Ahead a linebreak
| $ # 3. - EOS
| [^\S\r\n] # 4. - Accept an extra non-linebreak whitespace
)
) # End atomic group
|
.{1,32} # No valid word breaks, just break on the N'th character
) # (1 end)
(?: \r? \n )? # Optional linebreak after Words/Characters
|
# -- Or, Linebreak
(?: \r? \n | $ ) # Stand alone linebreak or at EOS
)
This is what worked for me (thanks to #StefanPochmann's comments):
text = "Some really long string\nwith some line breaks"
The following will first remove all whitespace before breaking the string up.
text.gsub(/\s+/, ' ').scan(/.{1,2000}(?: |$)/).map(&:strip)
The resulting chunks of strings will lose all the line breaks (\n) from the original string. If you need to maintain the line breaks, you need to replace them all with some random placeholder (before applying the regex), for example: (br), that you can use to restore the line breaks later. Like this:
text = "Some really long string\nwith some line breaks".gsub("\n", "(br)")
After we run the regex, we can restore the line breaks for the new chunks by replacing all occurrences of (br) with \n like this:
chunks = text.gsub(/\s+/, ' ').scan(/.{1,2000}(?: |$)/).map(&:strip)
chunks.each{|chunk| chunk.gsub!('(br)', "\n")}
Looks like a long process but it worked for me.

How to replace Perl-style regex with MatchData object

I am using the gsub method with a regular expression:
#text.gsub(/(-\n)(\S+)\s/) { "#{$2}\n" }
Example of input data:
"The wolverine is now es-
sentially absent from
the southern end
of its European range."
should return:
"The wolverine is now essentially
absent from
the southern end
of its European range."
The method works fine, but rubocop reports and offense:
Avoid the use of Perl-style backrefs.
Any ideas how to rewrite it using MatchData object instead of $2?
If you want to use Regexp.last_match :
#text.gsub(/(-\n)(\S+)\s/) { Regexp.last_match[2] + "\n" }
or :
#text.gsub(/-\n(\S+)\s/) { Regexp.last_match[1] + "\n" }
Note that the block in gsub should be used when logic is involved. Without logic, a second parameter set to "\\1\n" or '\1' + "\n" would do just fine.
You can use backslash without the block:
#text.gsub /(-\n)(\S+)\s/, "\\2\n"
Also, it's a bit cleaner to use only one group, since the first one above isn't needed:
#text.gsub /-\n(\S+)\s/, "\\1\n"
This solution accounts for errant spaces before newlines and split words that end a sentence or the string. It uses String#gsub with a block and no capture groups.
Code
R = /
[[:alpha:]]\- # match a letter followed by a hyphen
\s*\n # match a newline possibly preceded by whitespace
[[:alpha:]]+ # match one or more letters
[.?!]? # possibly match a sentence terminator
\n? # possibly match a newline
\s* # match zero or more whitespaces
/x # free-spacing regex definition mode
def remove_hyphens(str)
str.gsub(R) { |s| s.gsub(/[\n\s-]/, '') << "\n" }
end
Examples
str =<<_
The wolverine is now es-
sentially absent from
the south-
ern end of its
European range.
_
puts remove_hyphens(str)
The wolverine is now essentially
absent from
the southern
end of its
European range.
puts remove_hyphens("now es- \nsentially\nabsent")
now essentially
absent
puts remove_hyphens("now es-\nsentially.\nabsent")
now essentially.
absent
remove_hyphens("now es-\nsentially?\n")
#=> "now essentially?\n" (no extra \n at end)

Extract all words with # symbol from a string

I need to extract all #usernames from a string(for twitter) using rails/ruby:
String Examples:
"#tom #john how are you?"
"how are you #john?"
"#tom hi"
The function should extract all usernames from a string, plus without special characters disallowed for usernames... as you see "?" in an example...
From "Why can't I register certain usernames?":
A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. Check to make sure your desired username doesn't contain any symbols, dashes, or spaces.
The \w metacharacter is equivalent to [a-zA-Z0-9_]:
/\w/ - A word character ([a-zA-Z0-9_])
Simply scanning for #\w+ will succeed according to that:
strings = [
"#tom #john how are you?",
"how are you #john?",
"#tom hi",
"#foo #_foo #foo_ #foo_bar #f123bar #f_123_bar"
]
strings.map { |s| s.scan(/#\w+/) }
# => [["#tom", "#john"],
# ["#john"],
# ["#tom"],
# ["#foo", "#_foo", "#foo_", "#foo_bar", "#f123bar", "#f_123_bar"]]
There are multiple ways to do it - here's one way:
string = "#tom #john how are you?"
words = string.split " "
twitter_handles = words.select do |word|
word.start_with?('#') && word[1..-1].chars.all? do |char|
char =~ /[a-zA-Z1-9\_]/
end && word.length > 1
end
The char =~ regex will only accept alphaneumerics and the underscore
r = /
# # match character
[[[:alpha:]]]+ # match one or more letters
\b # match word break
/x # free-spacing regex definition mode
"#tom #john how are you? And you, #andré?".scan(r)
#=> ["#tom", "#john", "#andré"]
If you wish to instead return
["tom", "john", "andré"]
change the first line of the regex from # to
(?<=#)
which is a positive lookbehind. It requires that the character "#" be present but it will not be part of the match.

Ruby regex eliminate new line until . or ? or capital letter

I'd like to do the following with my strings:
line1= "You have a house\nnext to the corner."
Eliminate \n if the sentence doesn't finish in new line after dot or question mark or capital letter, so the desired output will be in this case:
"You have a house next to the corner.\n"
So another example, this time with the question mark:
"You like baggy trousers,\ndon't you?
should become:
"You like baggy trousers, don't you?\n".
I've tried:
line1.gsub!(/(?<!?|.)"\n"/, " ")
(?<!?|.) this immediately preceding \n there must NOT be either question mark(?) or a comma
But I get the following syntax error:
SyntaxError: (eval):2: target of repeat operator is not specified: /(?<!?|.)"\n"/
And for the sentences where in the middle of them there's a capital letter, insert a \n before that capital letter so the sentence:
"We were winning The Home Secretary played a important role."
Should become:
"We were winning\nThe Home Secretary played a important role."
NOTE: The answer is not meant to provide a generic way to remove unnecessary newline symbols inside sentences, it is only meant to serve OP purpose to only remove or insert newlines in specific places in a string.
Since you need to replace matches in different scenarios differently, you should consider a 2-step approach.
.gsub(/(?<![?.])\n/, ' ')
This one will replace all newlines that are not preceded with ? and . (as (?<![?.]) is a negative lookbehind failing the match if there is a subpattern match before the current location inside the string).
The second step is
.sub(/(?<!^) *+(?=[A-Z])/, '\n')
or
.sub(/(?<!^) *+(?=\p{Lu})/, '\n')
It will match 0+ spaces ( *+) (possessively, no backtracking into the space pattern) that are not at the beginning of the line (due to the (?<!^) negative lookbehind, replace ^ with \A to match the start of the whole string), and that is followed with a capital letter ((?=\p{Lu}) is a positive lookahead that requires a pattern to appear right after the current location to the right).
You are nearly there. You need to a) escape both ? and . and b) remove quotation marks around \n in the expression:
line1= "You have a house\nnext to the corner.\nYes?\nNo."
line1.gsub!(/(?<!\?|\.)\s*\n\s*/, " ")
#⇒ "You have a house next to the corner.\nYes?\nNo."
As you want the trailing \n, just add it afterwards:
line1.gsub! /\Z/, "\n"
#⇒ "You have a house next to the corner.\nYes?\nNo.\n"
The simple way to do this is to replace all the embedded new-lines with a space, which effectively joins the line segments, then fix the line-end. It's not necessary to worry about the punctuation and it's not necessary to use (or maintain) a regex.
You can do this a lot of ways, but I'd use:
sentences = [
"foo\nbar",
"foo\n\nbar",
"foo\nbar\n",
]
sentences.map{ |s| s.gsub("\n", ' ').squeeze(' ').strip + "\n" }
# => ["foo bar\n", "foo bar\n", "foo bar\n"]
Here's what's happening inside the map block:
s # => "foo\nbar", "foo\n\nbar", "foo\nbar\n"
.gsub("\n", ' ') # => "foo bar", "foo bar", "foo bar "
.squeeze(' ') # => "foo bar", "foo bar", "foo bar "
.strip # => "foo bar", "foo bar", "foo bar"
+ "\n"

ruby code for modifying outer quotes on strings?

Does anyone know of a Ruby gem (or built-in, or native syntax, for that matter) that operates on the outer quote marks of strings?
I find myself writing methods like this over and over again:
remove_outer_quotes_if_quoted( myString, chars ) -> aString
add_outer_quotes_unless_quoted( myString, char ) -> aString
The first tests myString to see if its beginning and ending characters match any one character in chars. If so, it returns the string with quotes removed. Otherwise it returns it unchanged. chars defaults to a list of quote mark characters.
The second tests myString to see if it already begins and ends with char. If so, it returns the string unchanged. If not, it returns the string with char tacked on before and after, and any embedded occurrance of char is escaped with backslash. char defaults to the first in a default list of characters.
(My hand-cobbled methods don't have such verbose names, of course.)
I've looked around for similar methods in the public repos but can't find anything like this. Am I the only one that needs to do this alot? If not, how does everyone else do this?
If you do it a lot, you may want to add a method to String:
class String
def strip_quotes
gsub(/\A['"]+|['"]+\Z/, "")
end
end
Then you can just call string.strip_quotes.
Adding quotes is similar:
class String
def add_quotes
%Q/"#{strip_quotes}"/
end
end
This is called as string.add_quotes and uses strip_quotes before adding double quotes.
This might 'splain how to remove and add them:
str1 = %["We're not in Kansas anymore."]
str2 = %['He said, "Time flies like an arrow, Fruit flies like a banana."']
puts str1
puts str2
puts
puts str1.sub(/\A['"]/, '').sub(/['"]\z/, '')
puts str2.sub(/\A['"]/, '').sub(/['"]\z/, '')
puts
str3 = "foo"
str4 = 'bar'
[str1, str2, str3, str4].each do |str|
puts (str[/\A['"]/] && str[/['"]\z/]) ? str : %Q{"#{str}"}
end
The original two lines:
# >> "We're not in Kansas anymore."
# >> 'He said, "Time flies like an arrow, Fruit flies like a banana."'
Stripping quotes:
# >> We're not in Kansas anymore.
# >> He said, "Time flies like an arrow, Fruit flies like a banana."
Adding quotes when needed:
# >> "We're not in Kansas anymore."
# >> 'He said, "Time flies like an arrow, Fruit flies like a banana."'
# >> "foo"
# >> "bar"
I would use the value = value[1...-1] if value[0] == value[-1] && %w[' "].include?(value[0]). In short, this simple code checks whether first and last char of string are the same and removes them if they are single/double quote. Additionally as many as needed quote types can be added.
%w["adadasd" 'asdasdasd' 'asdasdasd"].each do |value|
puts 'Original value: ' + value
value = value[1...-1] if value[0] == value[-1] && %w[' "].include?(value[0])
puts 'Processed value: ' + value
end
The example above will print the following:
Original value: "adadasd"
Processed value: adadasd
Original value: 'asdasdasd'
Processed value: asdasdasd
Original value: 'asdasdasd"
Processed value: 'asdasdasd"

Resources