How do I apply gsub subject to a function? - ruby

I"m using Rails 5 and Ruby 2.4. I have a function
my_function(str1, str2)
that will return true or false given two string arguments. What I would like to do is given a larger string, for instance
"a b c d"
I would like to replace two consecutive "words" (a word by my definition is a sequence of characters followed by a word boundary) with the empty string if the expression
my_function(str1, str2)
evaluates to true for those two consecutive words. So for instance, if
my_function("b", "c")
evaluates to true, I would like the above string to become
"a d"
How do I do this?
Edit: I'm including the output based on Tom Lord's answer ...
If I use
def stuff(line)
matches = line.scan(/\b((\S+?)\b.*?\b(\S+?))\b/)
matches.each do |full_match, word1, word2|
line.delete!(full_match) if word1.eql?("hello") && word2.eql?("world")
end
end
and line is
"hello world this is a test"
the resulting string line is
"tisisatst"
THis is not quite what I expected. THe result should be
" this is a test"

Edit: This is an updated answer, based on the comments below. I have left my original answer at the bottom.
Scanning a string for "two consecutive words" is a bit tricky. Your best option is probably to use the \b anchor in a regex, which signifies a "word boundary":
string_to_change = "a b c d"
matches = string_to_change.scan(/\b((\S+?)\b.*?\b(\S+?))\b/)
# => [["a b", "a", "b"], ["c d", "c", "d"]]
...Where the first string is the "full match" (including any whitespace or punctuation), the others are the two words.
To break down that regex:
\b means "word boundary". I have placed one of each side of both strings. This solution assumes that str1 and str2 are both a single word. (If they contain spaces, then I don't know what behaviour you expect?)
\S+? means "one or more non-whitespace character". (Matching non-greedily, so it will stop matching at the first word boundary).
You can then remove each "full match" from the string, if the method returns true for the two words:
matches.each do |full_match, word1, word2|
string_to_change.gsub!(full_match, '') if my_function(word1, word2)
end
One thing that's not accounted for here (you didn't specify this well in your question...) was how to handle strings containing three or more words. For example, consider the following:
"hello world this is a test"
Suppose my_function(word1, word2) returns true only for the pairs: "world", "this" and "hello", "is".
My code above will only look at the pairs: "hello", "world", "this", "is" and "a", "test". But perhaps it should actually:
Look at all pairs of words, i.e. match all words with the left- and right- hand side.
Delete pairs of words repeatedly, i.e. after the initial pair: "world this" is removed, the string should be re-scanned and then "hello is" should also be removed?
If such further enhancements are needed, then please explain them clearly in a new question (if you are struggling to solve the problem yourself).
Original answer:
str1 = "b"
str2 = "c"
string_to_change = "a b c d"
if my_function(str1, str2)
string_to_change.gsub!(/\b#{str1}\b\s+\b#{str2}\b/, "")
end
To break down that regex:
\b means "word boundary". I have placed one of each side of both strings. This solution assumes that str1 and str2 are both a single word. (If they contain spaces, then I don't know what behaviour you expect?)
\s+ means "one or more whitespace character". You may wish to tweak this to allow other punctuation too, such as a comma or full stop. A fully generic solution to this issue could in fact be:
.
string_to_change.gsub!(/\b#{str1}\b.(\B.)*#{str2}\b/, "")
# Or equivalently:
string_to_change.gsub!(/\b#{str1}\b(.\B)*.#{str2}\b/, "")
.(\B.)* is instead collecting each character, one at a time, always checking that it's not the first letter of a word (i.e. is proceeded by a non-word boundary).

Related

Use regular expression to fetch 3 groups from string

This is my expected result.
Input a string and get three returned string.
I have no idea how to finish it with Regex in Ruby.
this is my roughly idea.
match(/(.*?)(_)(.*?)(\d+)/)
Input and expected output
# "R224_OO2003" => R224, OO, 2003
# "R2241_OOP2003" => R2244, OOP, 2003
If the example description I gave in my comment on the question is correct, you need a very straightforward regex:
r = /(.+)_(.+)(\d{4})/
Then:
"R224_OO2003".scan(r).flatten #=> ["R224", "OO", "2003"]
"R2241_OOP2003".scan(r).flatten #=> ["R2241", "OOP", "2003"]
Assuming that your three parts consist of (R and one or more digits), then an underbar, then (one or more non-whitespace characters), before finally (a 4-digit numeric date), then your regex could be something like this:
^(R\d+)_(\S+)(\d{4})$
The ^ indicates start of string, and the $ indicates end of string. \d+ indicates one or more digits, while \S+ says one or more non-whitespace characters. The \d{4} says exactly four digits.
To recover data from the matches, you could either use the pre-defined globals that line up with your groups, or you could could use named captures.
To use the match globals just use $1, $2, and $3. In general, you can figure out the number to use by counting the left parentheses of the specific group.
To use the named captures, include ? right after the left paren of a particular group. For example:
x = "R2241_OOP2003"
match_data = /^(?<first>R\d+)_(?<second>\S+)(?<third>\d{4})$/.match(x)
puts match_data['first'], match_data['second'], match_data['third']
yields
R2241
OOP
2003
as expected.
As long as your pattern covers all possibilities, then you just need to use the match object to return the 3 strings:
my_match = "R224_OO2003".match(/(.*?)(_)(.*?)(\d+)/)
#=> #<MatchData "R224_OO2003" 1:"R224" 2:"_" 3:"OO" 4:"2003">
puts my_match[0] #=> "R224_OO2003"
puts my_match[1] #=> "R224"
puts my_match[2] #=> "_"
puts my_match[3] #=> "00"
puts my_match[4] #=> "2003"
A MatchData object contains an array of each match group starting at index [1]. As you can see, index [0] returns the entire string. If you don't want the capture the "_" you can leave it's parentheses out.
Also, I'm not sure you are getting what you want with the part:
(.*?)
this basically says one or more of any single character followed by zero or one of any single character.

regex scan only returning first value

I have two strings that should both return matches according to the regex, but only str1 returns the expected match. str1 is an exact match for the regex (created by Avinash Raj) below. str2 contains str1 and more data. I expected str2 to return str1 and more values that matched, but it returns nothing Can someone explain why?
str1="3,15,14,31,40,5,5,4,5,3,4,4,5,2,2,2,1,2,1,1,3,3,3,2,4,3,false,false,false,false,false,true,false,true,false,false,false,false,false,false,false,true,false,false,false,false,false,false,false,false,false,false,false,3,3,3,2,3"
str2="3,15,14,31,40,5,5,4,5,3,4,4,5,2,2,2,1,2,1,1,3,3,3,2,4,3,false,false,false,false,false,true,false,true,false,false,false,false,false,false,false,true,false,false,false,false,false,false,false,false,false,false,false,3,3,3,2,3,3,15,14,35,27,4,5,3,5,3,2,4,4,2,1,1,2,2,2,1,3,3,3,2,5,9,true,false,false,false,false,false,false,false,false,true,false,false,false,false,false,false,false,false,false,false,false,true,true,false,false,false,false,2,2,3,2,3,3,15,16,34,53,4,4,4,3,1,3,4,3,1,1,1,1,1,1,1,2,3,2,3,5,1,true,false,false,false,false,false,true,false,false,false,false,false,false,false,true,true,false,false,false,false,false,false,false,false,false,false,false,3,2,3,2,3,3,15,18,37,29,4,4,4,3,2,3,3,4,1,1,1,1,1,1,1,1,3,1,2,4,1,true,false,false,false,false,false,false,false,true,false,false,false,false,false,false,false,false,false,false,false,false,false,true,false,false,false,false,3,2,3,2,3,3,15,20,34,37,4,4,4,3,1,3,3,4,1,1,1,1,1,1,1,1,1,1,2,4,1,false,false,false,true,false,false,false,false,false,false,false,false,false,true,false,true,false,false,false,false,false,false,false,false,true,false,false,3,1,3,1,3,3,16,10,18,30,4,3,3,3,1,3,3,3,1,1,1,1,1,1,1,1,2,1,4,4,3,false,false,false,false,false,true,false,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,true,false,false,3,2,3,2,3,3,16,12,39,5,5,5,4,5,3,5,5,5,1,1,1,1,1,1,1,2,1,1,1,5,10,false,false,false,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,true,false,3,2,3,2,3,3,16,14,18,27,4,4,4,4,2,3,3,4,1,1,1,1,1,1,1,1,1,1,2,5,1,true,false,false,false,false,false,false,false,false,false,false,false,false,true,false,true,false,false,false,false,true,false,false,false,false,false,false,3,2,3,2,3,3,16,16,18,32,5,5,5,5,4,5,5,5,1,1,1,1,1,1,1,2,1,1,1,5,3,false,false,false,false,false,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,true,false,true,false,3,2,3,2,3,3,16,18,20,7,5,5,5,5,3,3,3,4,1,1,1,1,1,1,1,1,1,1,2,5,1,false,false,false,true,false,false,false,false,false,false,false,false,false,true,false,false,false,false,false,false,true,false,false,false,false,false,false,3,2,3,2,3,3,16,20,18,59,4,4,4,3,1,1,1,2,1,1,1,1,1,1,1,1,2,2,4,5,9,false,false,false,true,false,false,false,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,true,false,false,false,false,3,2,3,2,3,3,17,10,16,9,3,3,3,3,1,2,3,3,1,1,1,1,1,1,1,1,2,1,3,5,1,true,false,false,false,false,false,true,false,false,false,false,false,false,false,false,false,false,false,false,false,true,false,false,false,false,false,false,3,2,3,2,3,3,17,12,16,17,4,3,4,2,1,4,3,2,1,1,1,1,1,1,1,1,1,1,4,5,3,false,false,false,false,false,true,false,false,false,false,false,false,false,false,false,true,false,false,false,false,false,false,false,false,false,false,false,3,2,3,2,3,3,17,14,16,21,4,4,4,4,1,3,4,4,1,1,1,1,1,1,1,1,1,1,2,5,1,false,false,false,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,true,true,false,false,3,2,3,2,3,3,17,16,16,20,5,5,4,5,3,4,4,5,1,1,1,1,1,1,1,1,1,1,1,5,8,false,false,false,true,false,false,false,false,false,false,false,false,false,false,false,true,false,false,false,false,false,false,false,true,false,false,false,3,2,3,2,3,3,17,18,16,31,4,4,4,4,1,4,3,3,1,1,1,1,1,1,1,1,1,1,3,5,1,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,true,false,true,false,true,false,false,false,false,3,2,3,2,3,3,17,20,18,8,5,5,4,5,4,4,4,5,1,1,1,1,1,1,1,2,1,1,1,5,1,false,false,false,true,false,false,false,false,false,false,false,false,false,true,false,false,false,false,false,false,false,false,false,true,false,false,false,3,2,3,2,3,3,18,10,31,33,3,2,3,2,2,2,2,3,1,1,1,1,1,1,1,1,1,1,1,5,7,true,false,false,false,false,false,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,3,2,3,2,3,3,18,12,36,11,4,4,4,5,3,4,3,3,1,1,2,1,2,1,2,2,1,1,1,5,1,false,false,false,true,false,false,true,false,false,false,false,false,false,true,false,true,false,false,false,false,false,false,false,false,true,false,false,3,2,3,2,3,3,18,14,49,6,3,3,2,2,1,2,2,2,2,1,1,1,2,1,2,3,3,4,4,5,9,true,false,false,false,false,false,false,false,false,true,false,false,false,false,false,false,false,false,false,false,false,false,true,false,false,false,false,3,2,3,2,3,3,18,16,32,53,3,4,4,3,3,3,3,3,1,1,1,1,1,1,2,2,1,1,3,5,7,false,false,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,true,false,false,true,false,false,false,false,3,2,3,2,3,3,18,18,37,59,5,4,4,4,4,4,4,4,1,1,1,1,1,1,1,2,1,1,2,5,7,false,false,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,true,false,false,true,false,false,false,false,3,2,3,2,3,3,19,10,5,25,4,4,4,2,2,4,3,3,1,1,1,1,1,1,1,1,2,2,2,5,1,true,false,false,false,false,false,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,2,1,3,2,3,3,19,13,0,5,5,5,4,5,3,3,5,5,1,1,1,1,1,1,1,1,1,1,3,5,7,false,false,true,false,false,false,false,false,false,false,false,false,false,true,false,false,false,false,false,false,true,false,false,false,false,false,false,3,2,3,2,3,3,19,14,5,23,4,4,4,4,3,4,3,3,1,1,1,1,1,1,1,1,1,2,2,5,9,false,false,true,false,false,false,false,false,false,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,3,2,3,2,3,3,19,16,7,19,5,4,4,4,3,4,3,3,1,1,1,1,1,1,1,2,2,2,3,5,9,false,false,true,false,false,false,false,false,false,true,false,false,false,false,false,false,false,false,false,false,false,false,true,false,false,false,false,3,2,3,2,3,3,19,18,6,30,4,4,4,4,3,4,4,4,1,1,1,1,1,1,1,1,1,1,1,5,8,false,false,true,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,true,false,true,false,false,false,true,false,false,3,2,3,2,3,3,19,20,8,25,4,4,5,4,3,4,3,4,1,1,1,1,1,1,1,1,1,1,3,5,1,false,false,true,false,false,false,true,false,false,false,false,false,false,true,false,false,false,false,false,false,false,false,false,false,true,false,false,3,2,3,2,3,3,19,21,18,2,4,4,4,3,3,4,3,4,1,1,1,1,1,1,1,1,1,1,1,5,1,false,false,true,false,false,false,false,false,false,false,false,false,false,true,false,false,false,false,true,false,true,false,false,false,true,false,false,3,2,3,2,3,"
str1.scan(/^,?(?:[1-5]\d|[1-9])(?:,(?:[1-5]\d|[1-9])){4}(?:,[1-5]){21}(?:,(?:true|false)){27}(?:,[1-5]){5}$/).each{|x|
puts x
puts "---1---"
}
str2.scan(/^,?(?:[1-5]\d|[1-9])(?:,(?:[1-5]\d|[1-9])){4}(?:,[1-5]){21}(?:,(?:true|false)){27}(?:,[1-5]){5}$/).each{|x|
puts x
puts "---2---"
}
Kind of by definition, you can't have more than one pattern match in a string when your pattern specifically says "start of string, then [stuff], then end of string". Look at regexp anchors ^ and $.
A simpler example might make it clearer: ^a$ "start of string, then letter a, then end of string" will match in "a" once, but will match in "aaa" zero times, even though there are three letters a.
$ assert position at end of a line
Now you are not matching upto the end of line.
^,?(?:[1-5]\d|[1-9])(?:,(?:[1-5]\d|[1-9])){4}(?:,[1-5]){21}(?:,(?:true|false)){27}(?:,[1-5]){5}
Just remove the $ from the end.See demo.
https://regex101.com/r/sJ9gM7/22
Because you're regular starts with the ^ metacharacter and ends with the $ metacharacter, it expects the full string to match.

How does this regexp split on the first vowel?

This code splits a word into two strings at the first vowel. Why?
word = "banana"
parts = word.split(/([aeiou].*)/)
The key here is the regular expression (or regex) that is being used between the two /'s
[aeiou] says to look for the first instance of one of those characters.
. matches any single character
* modifies the previous thing to mean match 0 or more of it
(...) means capture everything enclosed between the parentheses
Translated to english this regular expression might read something like "Given a string, find the first vowel that is followed by zero or more characters. Collect that vowel and its following characters and set them aside."
The slightly more confusing part is the regex's interaction with the split method. The value the regex returns is 'anana'. And we can see that calling split with 'anana' doesn't have the same result:
'banana'.split('anana') #=> ["b"]
But when split is called with a regular expression that uses a capture group - or parentheses (...), then anything in that capture group will also be returned in the result of the split. Which is why:
'banana'.split /([aeiou].*)/ #=> ["b", "anana"]
If you want to learn more about how regular expressions work (particularly in ruby), Rubular is a great resource to fiddle with - http://www.rubular.com/r/XEUgPhOdlH
This is actually a bit tricky. This regexp
/[aeiou].*/
matches the string from the first vowel to the end of the string i.e. "anana". But if you were to split on that, you would only get the first letter since split doesn't include the splitting pattern:
"banana".split /[aeiou].*/
# ["b"]
But according to the String#split docs, if the splitting pattern is a regexp with a capture group, the capture groups are included in the result as well. Since the whole pattern is wrapped in a capture group, the result is that the string splits before the first vowel.
For example, if you change the regexp to have two capture groups, it splits further:
"banana".split /([aeiou])(.*)/
# ["b", "a", "nana"]
ANSWER FOR OLD TITLE
It's not really a Ruby's syntax, it's a standard Regular Expression's syntax that also implemented by Ruby.
* means zero or more of previous item
. means any character
[aeiou] means any character inside the brace
() means capture it
So that regex means: capture anything that starts with a, e, i, o, or u.
the word.split(/([aeiou].*)/) means, split the word variable based on anything that starts with letter a, e, i, o, or u.
See here fore more information.
ANSWER FOR NEW TITLE
Why does it split on the first vowel? It's not really like that.. What it does is, split by anything that start with vowels and capture it (the string that starts with vowels) also, see more example here:
word = 'banana'
word.split /[aeiou]/ # split by vowels
#=> ["b", "n", "n"]
word.split /([aeiou])/ # split by vowels and capture the vowels
#=> ["b", "a", "n", "a", "n", "a"]
word.split /[aeiou].*/ # split by anything that start with vowels
#=> ["b"]
word.split /([aeiou].*)/ # split by anything that start with vowels and capture the thing that start with vowels also
#=> ["b", "anana"]
ANSWER FOR OLD TITLE
If the * symbol not inside the regular expression // (Ruby's syntax), there are some possibilities:
multiplication 2 * 3 == 6, 'na' * 3 == 'nanana' # batman!
splat operation [*(1..4)] == [1,2,3,4], see more info here

How can I match Word Boundary "or" [##]?

I can't seem to get a regex that matches either a hashtag #, an #, or a word-boundary. The goal is to break a string into Twitter-like entities and topics so:
input = "Hello #world, #ruby anotherString"
input.scan(entitiesRegex)
# => ["Hello", "#world", "#ruby", "anotherString"]
To get just the words, excluding "anotherString" which is too large, is simple:
/\b\w{3,12}\b/
will return ["Hello", "world", "ruby"]. Unfortunately this doesn't include the hashtags and #s. It seems like it should work simply with:
/[\b##]\w{3,12}\b/
but that returns ["#world", "#ruby"]. This made me realize that word boundaries are not by definition a character, so they don't fall into the category of "A single character" and, so, won't match. A few more attempts:
/\b|[##]\w{3,12}\b/
returns ["", "", "#world", "", "#ruby", "", "", ""].
/((\b|[##])\w{3,12}\b)/
matches the right things, but returns [[""], ["#"], ["#"], [""]] as expected, because the braces also mean capture everything enclosed.
/((\b|[##])\w{3,12}\b)/
kind of works. It returns [["Hello", ""], ["#world", "#"], ["#ruby", "#"]]. So now all the correct items are there, they're just located at the first element of each of the subarrays. The following snippet technically works:
input.scan(/((\b|[##])\w{3,12}\b)/).collect(&:first)
Is it possible to simplify this to match and return the correct substrings with just the regular expression not requiring the collect post-processing?
You can just use the regular expression /[##]?\b\w+\b/. That is, optionally match a # or #, followed by a word boundary (in #ruby, that boundary would be between # and ruby, in a normal word it would also match at the start of the word) and a bunch of word characters.
p "Hello #world, #ruby anotherString".scan(/[##]?\b\w+\b/)
# => ["Hello", "#world", "#ruby", "anotherString"]
Furthermore, you can adjust the number of characters a matching word should have with quantifiers. You gave an example in a comment to a deleted answer to match only #ruby by using {3,4}:
p "Hello #world, #ruby anotherString".scan(/[##]?\b\w{3,4}\b/)
# => ["#ruby"]

Regex find 'a' or 'an' in sentence in Ruby

I am beginner in Regex. I thought I would complete this without help but couldn't.
I want to find article word pair from following sentence(where article must be A or An):
This is a sentence. An egg is a word. A gee another word.
Last line is a word. Ocean is very big.
I used this regex pattern:
/[(An)|(an)|a|A]\s+\w+[\s|.]/
Captured pairs are:
'a sentence.', 'n egg ', 'a word.', 'A gee ', 'a word.', 'n is '.
Above pattern couldn't capture An egg fully. However, more strangely it captured 'n is ' in Ocean is.
What could be correct pattern to extract it?
Add a word boundary:
/\b(an?)\s+\w+/i
Edit: (n mustn't be capital)
/\b([aA]n?)\s+\w+/
s = 'This is a sentence. An egg is a word. A gee another word.\nLast line is a word. Ocean is very big.'
s.scan /(?<=\A|\s)[Aa]n?\s+[A-Za-z]+/m
# => [
# [0] "a sentence",
# [1] "An egg",
# [2] "a word",
# [3] "A gee",
# [4] "a word"
# ]
Here we go: /(?<=\A|\s)[Aa]n?\s+[A-Za-z]+/m
First is a lookbehind for not matching “an is” in “Ocean is.” Then we look for A (maybe capital), possibly followed by “n”, then spaces and word itself. Final m states for multiline.
To avoid using lookbehind, one may change the regexp to:
/\b[Aa]n?\s+[A-Za-z]+/m
UPD One should avoid using \w here since \w matches [A-Za-z0-9_] note especially the underscore.
Try simplifying to \b(An|an|a|A) \w+\b.
I'd use a very simple pattern, along with scan to find all occurrences:
sentence = <<EOT
This is a sentence. An egg is a word. A gee another word.
Last line is a word. Ocean is very big.
EOT
sentence.scan(/\b an? \s+ [a-z]+/imx)
# => ["a sentence", "An egg", "a word", "A gee", "a word"]
I'm using the x flag to improve the readability of the pattern.
The pattern breaks down to:
\b: a word-boundary so only "a" or "an" match. (It's case insensitive.)
an?: matches "a" or "an".
\s+: matches one or more white-spaces.
[a-z]+: matches consecutive runs of letters only. This is significant because any pattern using the \w character-class would also match 0..9 and "_" (underscore). Your sample doesn't contain those, but any text containing those characters would be likely to give you bad results.
The i flag means ignore case. The m flag means to treat the text as a single line of text. Normally line-ends are more significant. x means that white-spaces in the pattern are not significant, requiring \s to mark where they should be.
If you want the trailing punctuation or space, add . to the end of the pattern:
sentence.scan(/\b an? \s+ [a-z]+ ./imx)
# => ["a sentence.", "An egg ", "a word.", "A gee ", "a word."]

Resources