Regex find 'a' or 'an' in sentence in Ruby - ruby

I am beginner in Regex. I thought I would complete this without help but couldn't.
I want to find article word pair from following sentence(where article must be A or An):
This is a sentence. An egg is a word. A gee another word.
Last line is a word. Ocean is very big.
I used this regex pattern:
/[(An)|(an)|a|A]\s+\w+[\s|.]/
Captured pairs are:
'a sentence.', 'n egg ', 'a word.', 'A gee ', 'a word.', 'n is '.
Above pattern couldn't capture An egg fully. However, more strangely it captured 'n is ' in Ocean is.
What could be correct pattern to extract it?

Add a word boundary:
/\b(an?)\s+\w+/i
Edit: (n mustn't be capital)
/\b([aA]n?)\s+\w+/

s = 'This is a sentence. An egg is a word. A gee another word.\nLast line is a word. Ocean is very big.'
s.scan /(?<=\A|\s)[Aa]n?\s+[A-Za-z]+/m
# => [
# [0] "a sentence",
# [1] "An egg",
# [2] "a word",
# [3] "A gee",
# [4] "a word"
# ]
Here we go: /(?<=\A|\s)[Aa]n?\s+[A-Za-z]+/m
First is a lookbehind for not matching “an is” in “Ocean is.” Then we look for A (maybe capital), possibly followed by “n”, then spaces and word itself. Final m states for multiline.
To avoid using lookbehind, one may change the regexp to:
/\b[Aa]n?\s+[A-Za-z]+/m
UPD One should avoid using \w here since \w matches [A-Za-z0-9_] note especially the underscore.

Try simplifying to \b(An|an|a|A) \w+\b.

I'd use a very simple pattern, along with scan to find all occurrences:
sentence = <<EOT
This is a sentence. An egg is a word. A gee another word.
Last line is a word. Ocean is very big.
EOT
sentence.scan(/\b an? \s+ [a-z]+/imx)
# => ["a sentence", "An egg", "a word", "A gee", "a word"]
I'm using the x flag to improve the readability of the pattern.
The pattern breaks down to:
\b: a word-boundary so only "a" or "an" match. (It's case insensitive.)
an?: matches "a" or "an".
\s+: matches one or more white-spaces.
[a-z]+: matches consecutive runs of letters only. This is significant because any pattern using the \w character-class would also match 0..9 and "_" (underscore). Your sample doesn't contain those, but any text containing those characters would be likely to give you bad results.
The i flag means ignore case. The m flag means to treat the text as a single line of text. Normally line-ends are more significant. x means that white-spaces in the pattern are not significant, requiring \s to mark where they should be.
If you want the trailing punctuation or space, add . to the end of the pattern:
sentence.scan(/\b an? \s+ [a-z]+ ./imx)
# => ["a sentence.", "An egg ", "a word.", "A gee ", "a word."]

Related

How do I apply gsub subject to a function?

I"m using Rails 5 and Ruby 2.4. I have a function
my_function(str1, str2)
that will return true or false given two string arguments. What I would like to do is given a larger string, for instance
"a b c d"
I would like to replace two consecutive "words" (a word by my definition is a sequence of characters followed by a word boundary) with the empty string if the expression
my_function(str1, str2)
evaluates to true for those two consecutive words. So for instance, if
my_function("b", "c")
evaluates to true, I would like the above string to become
"a d"
How do I do this?
Edit: I'm including the output based on Tom Lord's answer ...
If I use
def stuff(line)
matches = line.scan(/\b((\S+?)\b.*?\b(\S+?))\b/)
matches.each do |full_match, word1, word2|
line.delete!(full_match) if word1.eql?("hello") && word2.eql?("world")
end
end
and line is
"hello world this is a test"
the resulting string line is
"tisisatst"
THis is not quite what I expected. THe result should be
" this is a test"
Edit: This is an updated answer, based on the comments below. I have left my original answer at the bottom.
Scanning a string for "two consecutive words" is a bit tricky. Your best option is probably to use the \b anchor in a regex, which signifies a "word boundary":
string_to_change = "a b c d"
matches = string_to_change.scan(/\b((\S+?)\b.*?\b(\S+?))\b/)
# => [["a b", "a", "b"], ["c d", "c", "d"]]
...Where the first string is the "full match" (including any whitespace or punctuation), the others are the two words.
To break down that regex:
\b means "word boundary". I have placed one of each side of both strings. This solution assumes that str1 and str2 are both a single word. (If they contain spaces, then I don't know what behaviour you expect?)
\S+? means "one or more non-whitespace character". (Matching non-greedily, so it will stop matching at the first word boundary).
You can then remove each "full match" from the string, if the method returns true for the two words:
matches.each do |full_match, word1, word2|
string_to_change.gsub!(full_match, '') if my_function(word1, word2)
end
One thing that's not accounted for here (you didn't specify this well in your question...) was how to handle strings containing three or more words. For example, consider the following:
"hello world this is a test"
Suppose my_function(word1, word2) returns true only for the pairs: "world", "this" and "hello", "is".
My code above will only look at the pairs: "hello", "world", "this", "is" and "a", "test". But perhaps it should actually:
Look at all pairs of words, i.e. match all words with the left- and right- hand side.
Delete pairs of words repeatedly, i.e. after the initial pair: "world this" is removed, the string should be re-scanned and then "hello is" should also be removed?
If such further enhancements are needed, then please explain them clearly in a new question (if you are struggling to solve the problem yourself).
Original answer:
str1 = "b"
str2 = "c"
string_to_change = "a b c d"
if my_function(str1, str2)
string_to_change.gsub!(/\b#{str1}\b\s+\b#{str2}\b/, "")
end
To break down that regex:
\b means "word boundary". I have placed one of each side of both strings. This solution assumes that str1 and str2 are both a single word. (If they contain spaces, then I don't know what behaviour you expect?)
\s+ means "one or more whitespace character". You may wish to tweak this to allow other punctuation too, such as a comma or full stop. A fully generic solution to this issue could in fact be:
.
string_to_change.gsub!(/\b#{str1}\b.(\B.)*#{str2}\b/, "")
# Or equivalently:
string_to_change.gsub!(/\b#{str1}\b(.\B)*.#{str2}\b/, "")
.(\B.)* is instead collecting each character, one at a time, always checking that it's not the first letter of a word (i.e. is proceeded by a non-word boundary).

Ruby regex eliminate new line until . or ? or capital letter

I'd like to do the following with my strings:
line1= "You have a house\nnext to the corner."
Eliminate \n if the sentence doesn't finish in new line after dot or question mark or capital letter, so the desired output will be in this case:
"You have a house next to the corner.\n"
So another example, this time with the question mark:
"You like baggy trousers,\ndon't you?
should become:
"You like baggy trousers, don't you?\n".
I've tried:
line1.gsub!(/(?<!?|.)"\n"/, " ")
(?<!?|.) this immediately preceding \n there must NOT be either question mark(?) or a comma
But I get the following syntax error:
SyntaxError: (eval):2: target of repeat operator is not specified: /(?<!?|.)"\n"/
And for the sentences where in the middle of them there's a capital letter, insert a \n before that capital letter so the sentence:
"We were winning The Home Secretary played a important role."
Should become:
"We were winning\nThe Home Secretary played a important role."
NOTE: The answer is not meant to provide a generic way to remove unnecessary newline symbols inside sentences, it is only meant to serve OP purpose to only remove or insert newlines in specific places in a string.
Since you need to replace matches in different scenarios differently, you should consider a 2-step approach.
.gsub(/(?<![?.])\n/, ' ')
This one will replace all newlines that are not preceded with ? and . (as (?<![?.]) is a negative lookbehind failing the match if there is a subpattern match before the current location inside the string).
The second step is
.sub(/(?<!^) *+(?=[A-Z])/, '\n')
or
.sub(/(?<!^) *+(?=\p{Lu})/, '\n')
It will match 0+ spaces ( *+) (possessively, no backtracking into the space pattern) that are not at the beginning of the line (due to the (?<!^) negative lookbehind, replace ^ with \A to match the start of the whole string), and that is followed with a capital letter ((?=\p{Lu}) is a positive lookahead that requires a pattern to appear right after the current location to the right).
You are nearly there. You need to a) escape both ? and . and b) remove quotation marks around \n in the expression:
line1= "You have a house\nnext to the corner.\nYes?\nNo."
line1.gsub!(/(?<!\?|\.)\s*\n\s*/, " ")
#⇒ "You have a house next to the corner.\nYes?\nNo."
As you want the trailing \n, just add it afterwards:
line1.gsub! /\Z/, "\n"
#⇒ "You have a house next to the corner.\nYes?\nNo.\n"
The simple way to do this is to replace all the embedded new-lines with a space, which effectively joins the line segments, then fix the line-end. It's not necessary to worry about the punctuation and it's not necessary to use (or maintain) a regex.
You can do this a lot of ways, but I'd use:
sentences = [
"foo\nbar",
"foo\n\nbar",
"foo\nbar\n",
]
sentences.map{ |s| s.gsub("\n", ' ').squeeze(' ').strip + "\n" }
# => ["foo bar\n", "foo bar\n", "foo bar\n"]
Here's what's happening inside the map block:
s # => "foo\nbar", "foo\n\nbar", "foo\nbar\n"
.gsub("\n", ' ') # => "foo bar", "foo bar", "foo bar "
.squeeze(' ') # => "foo bar", "foo bar", "foo bar "
.strip # => "foo bar", "foo bar", "foo bar"
+ "\n"

Ruby Regular expression not matching properly

I am trying to creat a RegEx to find words that contains any vowel.
so far i have tried this
/(.*?\S[aeiou].*?[\s|\.])/i
but i have not used RegEx much so its not working properly.
for example if i input "test is 1234 and sky fly test1234"
it should match test , is, and, test1234 but showing
test, is,1234 and
if put something else then different output.
Alternatively you can also do something like:
"test is 1234 and sky fly test1234".split.find_all { |a| a =~ /[aeiou]/ }
# => ["test", "is", "and", "test1234"]
You could use the below regex.
\S*[aeiou]\S*
\S* matches zero or more non-space characters.
or
\w*[aeiou]\w*
It will solve:
\b\w*[aeiou]+\w*\b
https://www.debuggex.com/r/O-fU394iC5ErcSs7
or you can substitute \w by \S
\b\S*[aeiou]+\S*\b
https://www.debuggex.com/r/RNE6Y6q1q5yPJbe-
\b - a word boundary
\w - same as [_a-zA-Z0-9]
\S - a non-whitespace character
Try this:
\b\w*[aeiou]\w*\b
\b denotes a word boundry, so this regexp matches word bounty, zero or more letters, a vowel, zero or more letters and another word boundry

How the Anchor \z and \G works in Ruby?

I am using Ruby1.9.3. I am newbie to this platform.
From the doc I just got familiared with two anchor which are \z and \G. Now I little bit played with \z to see how it works, as the definition(End or End of String) made me confused, I can't understand what it meant say - by End. So I tried the below small snippets. But still unable to catch.
CODE
irb(main):011:0> str = "Hit him on the head me 2\n" + "Hit him on the head wit>
=> "Hit him on the head me 2\nHit him on the head with a 24\n"
irb(main):012:0> str =~ /\d\z/
=> nil
irb(main):013:0> str = "Hit him on the head me 24 2\n" + "Hit him on the head >
=> "Hit him on the head me 24 2\nHit him on the head with a 24\n"
irb(main):014:0> str =~ /\d\z/
=> nil
irb(main):018:0> str = "Hit1 him on the head me 24 2\n" + "Hit him on the head>
=> "Hit1 him on the head me 24 2\nHit him on the head with a11 11 24\n"
irb(main):019:0> str =~ /\d\z/
=> nil
irb(main):020:0>
Every time I got nil as the output. So how the calculation is going on for \z ? what does End mean? - I think my concept took anything wrong with the End word in the doc. So anyone could help me out to understand the reason what is happening with the out why so happening?
And also i didn't find any example for the anchor \G . Any example please from you people to make visualize how \G used in real time programming?
EDIT
irb(main):029:0>
irb(main):030:0* ("{123}{45}{6789}").scan(/\G(?!^)\{\d+\}/)
=> []
irb(main):031:0> ('{123}{45}{6789}').scan(/\G(?!^)\{\d+\}/)
=> []
irb(main):032:0>
Thanks
\z matches the end of the input. You are trying to find a match where 4 occurs at the end of the input. Problem is, there is a newline at the end of the input, so you don't find a match. \Z matches either the end of the input or a newline at the end of the input.
So:
/\d\z/
matches the "4" in:
"24"
and:
/\d\Z/
matches the "4" in the above example and the "4" in:
"24\n"
Check out this question for example of using \G:
Examples of regex matcher \G (The end of the previous match) in Java would be nice
UPDATE: Real-World uses for \G
I came up with a more real world example. Say you have a list of words that are separated by arbitrary characters that cannot be well predicted (or there's too many possibilities to list). You'd like to match these words where each word is its own match up until a particular word, after which you don't want to match any more words. For example:
foo,bar.baz:buz'fuzz*hoo-har/haz|fil^bil!bak
You want to match each word until 'har'. You don't want to match 'har' or any of the words that follow. You can do this relatively easily using the following pattern:
/(?<=^|\G\W)\w+\b(?<!har)/
rubular
The first attempt will match the beginning of the input followed by zero non-word character followed by 3 word characters ('foo') followed by a word boundary. Finally, a negative lookbehind assures that the word which has just been matched is not 'har'.
On the second attempt, matching picks back up at the end of the last match. 1 non-word character is matched (',' - though it is not captured due to the lookbehind, which is a zero-width assertion), followed by 3 characters ('bar').
This continues until 'har' is matched, at which point the negative lookbehind is triggered and the match fails. Because all matches are supposed to be "attached" to the last successful match, no additional words will be matched.
The result is:
foo
bar
baz
buz
fuzz
hoo
If you want to reverse it and have all words after 'har' (but, again, not including 'har'), you can use an expression like this:
/(?!^)(?<=har\W|\G\W)\w+\b/
rubular
This will match either a word which is immediately preceeded by 'har' or the end of the last match (except we have to make sure not to match the beginning of the input). The list of matches is:
haz
fil
bil
bak
If you do want to match 'har' and all following words, you could use this:
/\bhar\b|(?!^)(?<=\G\W)\w+\b/
rubular
This produces the following matches:
har
haz
fil
bil
bak
Sounds like you want to know how Regex works? Or do you want to know how Regex works with ruby?
Check these out.
Regexp Class description
The Regex Coach - Great for testing regex matching
Regex cheat sheet
I understand \G to be a boundary match character. So it would tell the next match to start at the end of the last match. Perhaps since you haven't made a match yet you cant have a second.
Here is the best example I can find. Its not in ruby but the concept should be the same.
I take it back this might be more useful

Ruby regex extracting words

I'm currently struggling to come up with a regex that can split up a string into words where words are defined as a sequence of characters surrounded by whitespace, or enclosed between double quotes. I'm using String#scan
For instance, the string:
' hello "my name" is "Tom"'
should match the words:
hello
my name
is
Tom
I managed to match the words enclosed in double quotes by using:
/"([^\"]*)"/
but I can't figure out how to incorporate the surrounded by whitespace characters to get 'hello', 'is', and 'Tom' while at the same time not screw up 'my name'.
Any help with this would be appreciated!
result = ' hello "my name" is "Tom"'.split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/)
will work for you. It will print
=> ["", "hello", "\"my name\"", "is", "\"Tom\""]
Just ignore the empty strings.
Explanation
"
\\s # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
(?: # Match the regular expression below
[^\"] # Match any character that is NOT a “\"”
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\" # Match the character “\"” literally
[^\"] # Match any character that is NOT a “\"”
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\" # Match the character “\"” literally
)* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
[^\"] # Match any character that is NOT a “\"”
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\$ # Assert position at the end of a line (at the end of the string or before a line break character)
)
"
You can use reject like this to avoid empty strings
result = ' hello "my name" is "Tom"'
.split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/).reject {|s| s.empty?}
prints
=> ["hello", "\"my name\"", "is", "\"Tom\""]
text = ' hello "my name" is "Tom"'
text.scan(/\s*("([^"]+)"|\w+)\s*/).each {|match| puts match[1] || match[0]}
Produces:
hello
my name
is
Tom
Explanation:
0 or more spaces followed by
either
some words within double-quotes OR
a single word
followed by 0 or more spaces
You can try this regex:
/\b(\w+)\b/
which uses \b to find the word boundary. And this web site http://rubular.com/ is helpful.

Resources