Ruby regex eliminate new line until . or ? or capital letter - ruby

I'd like to do the following with my strings:
line1= "You have a house\nnext to the corner."
Eliminate \n if the sentence doesn't finish in new line after dot or question mark or capital letter, so the desired output will be in this case:
"You have a house next to the corner.\n"
So another example, this time with the question mark:
"You like baggy trousers,\ndon't you?
should become:
"You like baggy trousers, don't you?\n".
I've tried:
line1.gsub!(/(?<!?|.)"\n"/, " ")
(?<!?|.) this immediately preceding \n there must NOT be either question mark(?) or a comma
But I get the following syntax error:
SyntaxError: (eval):2: target of repeat operator is not specified: /(?<!?|.)"\n"/
And for the sentences where in the middle of them there's a capital letter, insert a \n before that capital letter so the sentence:
"We were winning The Home Secretary played a important role."
Should become:
"We were winning\nThe Home Secretary played a important role."

NOTE: The answer is not meant to provide a generic way to remove unnecessary newline symbols inside sentences, it is only meant to serve OP purpose to only remove or insert newlines in specific places in a string.
Since you need to replace matches in different scenarios differently, you should consider a 2-step approach.
.gsub(/(?<![?.])\n/, ' ')
This one will replace all newlines that are not preceded with ? and . (as (?<![?.]) is a negative lookbehind failing the match if there is a subpattern match before the current location inside the string).
The second step is
.sub(/(?<!^) *+(?=[A-Z])/, '\n')
or
.sub(/(?<!^) *+(?=\p{Lu})/, '\n')
It will match 0+ spaces ( *+) (possessively, no backtracking into the space pattern) that are not at the beginning of the line (due to the (?<!^) negative lookbehind, replace ^ with \A to match the start of the whole string), and that is followed with a capital letter ((?=\p{Lu}) is a positive lookahead that requires a pattern to appear right after the current location to the right).

You are nearly there. You need to a) escape both ? and . and b) remove quotation marks around \n in the expression:
line1= "You have a house\nnext to the corner.\nYes?\nNo."
line1.gsub!(/(?<!\?|\.)\s*\n\s*/, " ")
#⇒ "You have a house next to the corner.\nYes?\nNo."
As you want the trailing \n, just add it afterwards:
line1.gsub! /\Z/, "\n"
#⇒ "You have a house next to the corner.\nYes?\nNo.\n"

The simple way to do this is to replace all the embedded new-lines with a space, which effectively joins the line segments, then fix the line-end. It's not necessary to worry about the punctuation and it's not necessary to use (or maintain) a regex.
You can do this a lot of ways, but I'd use:
sentences = [
"foo\nbar",
"foo\n\nbar",
"foo\nbar\n",
]
sentences.map{ |s| s.gsub("\n", ' ').squeeze(' ').strip + "\n" }
# => ["foo bar\n", "foo bar\n", "foo bar\n"]
Here's what's happening inside the map block:
s # => "foo\nbar", "foo\n\nbar", "foo\nbar\n"
.gsub("\n", ' ') # => "foo bar", "foo bar", "foo bar "
.squeeze(' ') # => "foo bar", "foo bar", "foo bar "
.strip # => "foo bar", "foo bar", "foo bar"
+ "\n"

Related

Delete all the whitespaces that occur after a word in ruby

I have a string " hello world! How is it going?"
The output I need is " helloworld!Howisitgoing?"
So all the whitespaces after hello should be removed. I am trying to do this in ruby using regex.
I tried strip and delete(' ') methods but I didn't get what I wanted.
some_string = " hello world! How is it going?"
some_string.delete(' ') #deletes all spaces
some_string.strip #removes trailing and leading spaces only
Please help. Thanks in advance!
There are numerous ways this could be accomplished without without a regular expressions, but using them could be the "cleanest" looking approach without taking sub-strings, etc. The regular expression I believe you are looking for is /(?!^)(\s)/.
" hello world! How is it going?".gsub(/(?!^)(\s)/, '')
#=> " helloworld!Howisitgoing?"
The \s matched any whitespace character (including tabs, etc), and the ^ is an "anchor" meaning the beginning of the string. The ! indicates to reject a match with following criteria. Using those together to your goal can be accomplished.
If you are not familiar with gsub, it is very similar to replace, but takes a regular expression. It additionally has a gsub! counter-part to mutate the string in place without creating a new altered copy.
Note that strictly speaking, this isn't all whitespace "after a word" to quote the exact question, but I gathered from your examples that your intentions were "all whitespace except beginning of string", which this will do.
def remove_spaces_after_word(str, word)
i = str.index(/\b#{word}\b/i)
return str if i.nil?
i += word.size
str.gsub(/ /) { Regexp.last_match.begin(0) >= i ? '' : ' ' }
end
remove_spaces_after_word("Hey hello world! How is it going?", "hello")
#=> "Hey helloworld!Howisitgoing?"

How do I apply gsub subject to a function?

I"m using Rails 5 and Ruby 2.4. I have a function
my_function(str1, str2)
that will return true or false given two string arguments. What I would like to do is given a larger string, for instance
"a b c d"
I would like to replace two consecutive "words" (a word by my definition is a sequence of characters followed by a word boundary) with the empty string if the expression
my_function(str1, str2)
evaluates to true for those two consecutive words. So for instance, if
my_function("b", "c")
evaluates to true, I would like the above string to become
"a d"
How do I do this?
Edit: I'm including the output based on Tom Lord's answer ...
If I use
def stuff(line)
matches = line.scan(/\b((\S+?)\b.*?\b(\S+?))\b/)
matches.each do |full_match, word1, word2|
line.delete!(full_match) if word1.eql?("hello") && word2.eql?("world")
end
end
and line is
"hello world this is a test"
the resulting string line is
"tisisatst"
THis is not quite what I expected. THe result should be
" this is a test"
Edit: This is an updated answer, based on the comments below. I have left my original answer at the bottom.
Scanning a string for "two consecutive words" is a bit tricky. Your best option is probably to use the \b anchor in a regex, which signifies a "word boundary":
string_to_change = "a b c d"
matches = string_to_change.scan(/\b((\S+?)\b.*?\b(\S+?))\b/)
# => [["a b", "a", "b"], ["c d", "c", "d"]]
...Where the first string is the "full match" (including any whitespace or punctuation), the others are the two words.
To break down that regex:
\b means "word boundary". I have placed one of each side of both strings. This solution assumes that str1 and str2 are both a single word. (If they contain spaces, then I don't know what behaviour you expect?)
\S+? means "one or more non-whitespace character". (Matching non-greedily, so it will stop matching at the first word boundary).
You can then remove each "full match" from the string, if the method returns true for the two words:
matches.each do |full_match, word1, word2|
string_to_change.gsub!(full_match, '') if my_function(word1, word2)
end
One thing that's not accounted for here (you didn't specify this well in your question...) was how to handle strings containing three or more words. For example, consider the following:
"hello world this is a test"
Suppose my_function(word1, word2) returns true only for the pairs: "world", "this" and "hello", "is".
My code above will only look at the pairs: "hello", "world", "this", "is" and "a", "test". But perhaps it should actually:
Look at all pairs of words, i.e. match all words with the left- and right- hand side.
Delete pairs of words repeatedly, i.e. after the initial pair: "world this" is removed, the string should be re-scanned and then "hello is" should also be removed?
If such further enhancements are needed, then please explain them clearly in a new question (if you are struggling to solve the problem yourself).
Original answer:
str1 = "b"
str2 = "c"
string_to_change = "a b c d"
if my_function(str1, str2)
string_to_change.gsub!(/\b#{str1}\b\s+\b#{str2}\b/, "")
end
To break down that regex:
\b means "word boundary". I have placed one of each side of both strings. This solution assumes that str1 and str2 are both a single word. (If they contain spaces, then I don't know what behaviour you expect?)
\s+ means "one or more whitespace character". You may wish to tweak this to allow other punctuation too, such as a comma or full stop. A fully generic solution to this issue could in fact be:
.
string_to_change.gsub!(/\b#{str1}\b.(\B.)*#{str2}\b/, "")
# Or equivalently:
string_to_change.gsub!(/\b#{str1}\b(.\B)*.#{str2}\b/, "")
.(\B.)* is instead collecting each character, one at a time, always checking that it's not the first letter of a word (i.e. is proceeded by a non-word boundary).

How to replace Perl-style regex with MatchData object

I am using the gsub method with a regular expression:
#text.gsub(/(-\n)(\S+)\s/) { "#{$2}\n" }
Example of input data:
"The wolverine is now es-
sentially absent from
the southern end
of its European range."
should return:
"The wolverine is now essentially
absent from
the southern end
of its European range."
The method works fine, but rubocop reports and offense:
Avoid the use of Perl-style backrefs.
Any ideas how to rewrite it using MatchData object instead of $2?
If you want to use Regexp.last_match :
#text.gsub(/(-\n)(\S+)\s/) { Regexp.last_match[2] + "\n" }
or :
#text.gsub(/-\n(\S+)\s/) { Regexp.last_match[1] + "\n" }
Note that the block in gsub should be used when logic is involved. Without logic, a second parameter set to "\\1\n" or '\1' + "\n" would do just fine.
You can use backslash without the block:
#text.gsub /(-\n)(\S+)\s/, "\\2\n"
Also, it's a bit cleaner to use only one group, since the first one above isn't needed:
#text.gsub /-\n(\S+)\s/, "\\1\n"
This solution accounts for errant spaces before newlines and split words that end a sentence or the string. It uses String#gsub with a block and no capture groups.
Code
R = /
[[:alpha:]]\- # match a letter followed by a hyphen
\s*\n # match a newline possibly preceded by whitespace
[[:alpha:]]+ # match one or more letters
[.?!]? # possibly match a sentence terminator
\n? # possibly match a newline
\s* # match zero or more whitespaces
/x # free-spacing regex definition mode
def remove_hyphens(str)
str.gsub(R) { |s| s.gsub(/[\n\s-]/, '') << "\n" }
end
Examples
str =<<_
The wolverine is now es-
sentially absent from
the south-
ern end of its
European range.
_
puts remove_hyphens(str)
The wolverine is now essentially
absent from
the southern
end of its
European range.
puts remove_hyphens("now es- \nsentially\nabsent")
now essentially
absent
puts remove_hyphens("now es-\nsentially.\nabsent")
now essentially.
absent
remove_hyphens("now es-\nsentially?\n")
#=> "now essentially?\n" (no extra \n at end)

Regex find 'a' or 'an' in sentence in Ruby

I am beginner in Regex. I thought I would complete this without help but couldn't.
I want to find article word pair from following sentence(where article must be A or An):
This is a sentence. An egg is a word. A gee another word.
Last line is a word. Ocean is very big.
I used this regex pattern:
/[(An)|(an)|a|A]\s+\w+[\s|.]/
Captured pairs are:
'a sentence.', 'n egg ', 'a word.', 'A gee ', 'a word.', 'n is '.
Above pattern couldn't capture An egg fully. However, more strangely it captured 'n is ' in Ocean is.
What could be correct pattern to extract it?
Add a word boundary:
/\b(an?)\s+\w+/i
Edit: (n mustn't be capital)
/\b([aA]n?)\s+\w+/
s = 'This is a sentence. An egg is a word. A gee another word.\nLast line is a word. Ocean is very big.'
s.scan /(?<=\A|\s)[Aa]n?\s+[A-Za-z]+/m
# => [
# [0] "a sentence",
# [1] "An egg",
# [2] "a word",
# [3] "A gee",
# [4] "a word"
# ]
Here we go: /(?<=\A|\s)[Aa]n?\s+[A-Za-z]+/m
First is a lookbehind for not matching “an is” in “Ocean is.” Then we look for A (maybe capital), possibly followed by “n”, then spaces and word itself. Final m states for multiline.
To avoid using lookbehind, one may change the regexp to:
/\b[Aa]n?\s+[A-Za-z]+/m
UPD One should avoid using \w here since \w matches [A-Za-z0-9_] note especially the underscore.
Try simplifying to \b(An|an|a|A) \w+\b.
I'd use a very simple pattern, along with scan to find all occurrences:
sentence = <<EOT
This is a sentence. An egg is a word. A gee another word.
Last line is a word. Ocean is very big.
EOT
sentence.scan(/\b an? \s+ [a-z]+/imx)
# => ["a sentence", "An egg", "a word", "A gee", "a word"]
I'm using the x flag to improve the readability of the pattern.
The pattern breaks down to:
\b: a word-boundary so only "a" or "an" match. (It's case insensitive.)
an?: matches "a" or "an".
\s+: matches one or more white-spaces.
[a-z]+: matches consecutive runs of letters only. This is significant because any pattern using the \w character-class would also match 0..9 and "_" (underscore). Your sample doesn't contain those, but any text containing those characters would be likely to give you bad results.
The i flag means ignore case. The m flag means to treat the text as a single line of text. Normally line-ends are more significant. x means that white-spaces in the pattern are not significant, requiring \s to mark where they should be.
If you want the trailing punctuation or space, add . to the end of the pattern:
sentence.scan(/\b an? \s+ [a-z]+ ./imx)
# => ["a sentence.", "An egg ", "a word.", "A gee ", "a word."]

Ruby's string: Escape and unescape a custom character

Suppose I said £ character as dangerous, and I want to be able to protect and to unprotect any string. And vice versa.
Example 1:
"Foobar £ foobar foobar foobar." # => dangerous string
"Foobar \£ foobar foobar foobar." # => protected string
Example 2:
"Foobar £ foobar £££££££foobar foobar." # => dangerous string
"Foobar \£ foobar \£\£\£\£\£\£\£foobar foobar." # => protected string
Example 3:
"Foobar \£ foobar \\£££££££foobar foobar." # => dangerous string
"Foobar \£ foobar \\\£\£\£\£\£\£\£foobar foobar." # => protected string
Is there an easy way, with Ruby, to escape (and unescape) a given character (such as £ in my example) from a string?
Edit: here is an explication about the behavior of this question.
First of all, thanks for your answers. I have a Rails app with a Tweet model having a content field. Example of tweet:
tweet = Tweet.create(content: "Hello #bob")
Inside the model, there's a serialization process that converte the string like this:
dump('Hello #bob') # => '["Hello £", 42]'
# ... where 42 is the id of bob username
Then, I'm able to deserialize and display its tweet like this:
load('["Hello £", 42]') # => 'Hello #bob'
In the same way, it's also possible to do so with more than one username:
dump('Hello #bob and #joe!') # => '["Hello £ and £!", 42, 185]'
load('["Hello £ and £!", 42, 185]') # => 'Hello #bob and #joe!'
That's the goal :)
But this find-and-replace could be hard to perform with something like:
tweet = Tweet.create(content: "£ Hello #bob")
'cause here we also have to escape £ char. And I think your solution is good for this. So the result become:
dump('£ Hello #bob') # => '["\£ Hello £", 42]'
load('["\£ Hello £", 42]') # => '£ Hello #bob'
Just perfect. <3 <3
Now, if there is this:
tweet = Tweet.create(content: "\£ Hello #bob")
I think we first should escape every \, and then escape every £, like:
dump('\£ Hello #bob') # => '["\\£ Hello £", 42]'
load('["\\£ Hello £", 42]') # => '£ Hello #bob'
However... how can we do in this case:
tweet = Tweet.create(content: "\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\£ Hello #bob")
...where tweet.content.gsub(/(?<!\\)(?=(?:\\\\)*£)/, "\\") seems not working.
Hopefully your version of ruby supports lookbehinds. If it doesn't my solution will not work for you.
Escape characters :
str = str.gsub(/(?<!\\)(?=(?:\\\\)*£)/, "\\")
Un-escape characters :
str = str.gsub(/(?<!\\)((?:\\\\)*)\\£/, "\1£")
Both regexes will work regardless of the amount of backslashes. They are complementing each other.
Escape explanation :
"
(?<! # Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind)
\\ # Match the character “\” literally
)
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
(?: # Match the regular expression below
\\ # Match the character “\” literally
\\ # Match the character “\” literally
)* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
£ # Match the character “£” literally
)
"
Not that I am matching a certain position. No text is consumed at all. When I pinpoint the position I want I insert a \.
Explanation of unescape :
"
(?<! # Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind)
\\ # Match the character “\” literally
)
( # Match the regular expression below and capture its match into backreference number 1
(?: # Match the regular expression below
\\ # Match the character “\” literally
\\ # Match the character “\” literally
)* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
)
\\ # Match the character “\” literally
£ # Match the character “£” literally
"
Here I am saving all the backslashes minus one and and I replace this number of backslashes with the special character. Tricky stuff :)
If you are using Ruby 1.9, which has lookbehind, then FailedDev's answer should work quite well. If you are using Ruby 1.8, which does not have lookbehind (I think), a different approach may work. Give this a try:
text.gsub!(/(\\.)|£)/m) do
if ($1 != nil) # If escaped anything
"$1" # replace with self.
else # Otherwise escape the
"\\£" # unescaped £.
end
end
Note that I am not a Ruby programmer and this snippet is untested (in particular I'm not sure if the: if ($1 != nil) statement usage is correct - it may need to be: if ($1 != "") or if ($1)), but I do know that this general technique (using code in place of a simple replacement string) works. I recently used this same technique for my JavaScript solution to a similar question which was looking to find unescaped asterisks.
I'm not sure if this is what you want, but I think you can do a simple find-and-replace:
str = str.gsub("£", "\\£") # to escape
str = str.gsub("\\£", "£") # to unescape
Note that I changed \ to \\ because you have to escape the backslash in a double-quoted string.
Edit: I think what you want is a regex that matches an odd number of backslashes:
str = str.gsub(/(^|[^\\])((?:\\\\)*)\\£/, "\\1\\2£")
That does the following transformations
"£" #=> "£"
"\\£" #=> "£"
"\\\\£" #=> "\\\\£"
"\\\\\\£" #=> "\\\\£"

Resources