How can I match Word Boundary "or" [##]? - ruby

I can't seem to get a regex that matches either a hashtag #, an #, or a word-boundary. The goal is to break a string into Twitter-like entities and topics so:
input = "Hello #world, #ruby anotherString"
input.scan(entitiesRegex)
# => ["Hello", "#world", "#ruby", "anotherString"]
To get just the words, excluding "anotherString" which is too large, is simple:
/\b\w{3,12}\b/
will return ["Hello", "world", "ruby"]. Unfortunately this doesn't include the hashtags and #s. It seems like it should work simply with:
/[\b##]\w{3,12}\b/
but that returns ["#world", "#ruby"]. This made me realize that word boundaries are not by definition a character, so they don't fall into the category of "A single character" and, so, won't match. A few more attempts:
/\b|[##]\w{3,12}\b/
returns ["", "", "#world", "", "#ruby", "", "", ""].
/((\b|[##])\w{3,12}\b)/
matches the right things, but returns [[""], ["#"], ["#"], [""]] as expected, because the braces also mean capture everything enclosed.
/((\b|[##])\w{3,12}\b)/
kind of works. It returns [["Hello", ""], ["#world", "#"], ["#ruby", "#"]]. So now all the correct items are there, they're just located at the first element of each of the subarrays. The following snippet technically works:
input.scan(/((\b|[##])\w{3,12}\b)/).collect(&:first)
Is it possible to simplify this to match and return the correct substrings with just the regular expression not requiring the collect post-processing?

You can just use the regular expression /[##]?\b\w+\b/. That is, optionally match a # or #, followed by a word boundary (in #ruby, that boundary would be between # and ruby, in a normal word it would also match at the start of the word) and a bunch of word characters.
p "Hello #world, #ruby anotherString".scan(/[##]?\b\w+\b/)
# => ["Hello", "#world", "#ruby", "anotherString"]
Furthermore, you can adjust the number of characters a matching word should have with quantifiers. You gave an example in a comment to a deleted answer to match only #ruby by using {3,4}:
p "Hello #world, #ruby anotherString".scan(/[##]?\b\w{3,4}\b/)
# => ["#ruby"]

Related

How do I apply gsub subject to a function?

I"m using Rails 5 and Ruby 2.4. I have a function
my_function(str1, str2)
that will return true or false given two string arguments. What I would like to do is given a larger string, for instance
"a b c d"
I would like to replace two consecutive "words" (a word by my definition is a sequence of characters followed by a word boundary) with the empty string if the expression
my_function(str1, str2)
evaluates to true for those two consecutive words. So for instance, if
my_function("b", "c")
evaluates to true, I would like the above string to become
"a d"
How do I do this?
Edit: I'm including the output based on Tom Lord's answer ...
If I use
def stuff(line)
matches = line.scan(/\b((\S+?)\b.*?\b(\S+?))\b/)
matches.each do |full_match, word1, word2|
line.delete!(full_match) if word1.eql?("hello") && word2.eql?("world")
end
end
and line is
"hello world this is a test"
the resulting string line is
"tisisatst"
THis is not quite what I expected. THe result should be
" this is a test"
Edit: This is an updated answer, based on the comments below. I have left my original answer at the bottom.
Scanning a string for "two consecutive words" is a bit tricky. Your best option is probably to use the \b anchor in a regex, which signifies a "word boundary":
string_to_change = "a b c d"
matches = string_to_change.scan(/\b((\S+?)\b.*?\b(\S+?))\b/)
# => [["a b", "a", "b"], ["c d", "c", "d"]]
...Where the first string is the "full match" (including any whitespace or punctuation), the others are the two words.
To break down that regex:
\b means "word boundary". I have placed one of each side of both strings. This solution assumes that str1 and str2 are both a single word. (If they contain spaces, then I don't know what behaviour you expect?)
\S+? means "one or more non-whitespace character". (Matching non-greedily, so it will stop matching at the first word boundary).
You can then remove each "full match" from the string, if the method returns true for the two words:
matches.each do |full_match, word1, word2|
string_to_change.gsub!(full_match, '') if my_function(word1, word2)
end
One thing that's not accounted for here (you didn't specify this well in your question...) was how to handle strings containing three or more words. For example, consider the following:
"hello world this is a test"
Suppose my_function(word1, word2) returns true only for the pairs: "world", "this" and "hello", "is".
My code above will only look at the pairs: "hello", "world", "this", "is" and "a", "test". But perhaps it should actually:
Look at all pairs of words, i.e. match all words with the left- and right- hand side.
Delete pairs of words repeatedly, i.e. after the initial pair: "world this" is removed, the string should be re-scanned and then "hello is" should also be removed?
If such further enhancements are needed, then please explain them clearly in a new question (if you are struggling to solve the problem yourself).
Original answer:
str1 = "b"
str2 = "c"
string_to_change = "a b c d"
if my_function(str1, str2)
string_to_change.gsub!(/\b#{str1}\b\s+\b#{str2}\b/, "")
end
To break down that regex:
\b means "word boundary". I have placed one of each side of both strings. This solution assumes that str1 and str2 are both a single word. (If they contain spaces, then I don't know what behaviour you expect?)
\s+ means "one or more whitespace character". You may wish to tweak this to allow other punctuation too, such as a comma or full stop. A fully generic solution to this issue could in fact be:
.
string_to_change.gsub!(/\b#{str1}\b.(\B.)*#{str2}\b/, "")
# Or equivalently:
string_to_change.gsub!(/\b#{str1}\b(.\B)*.#{str2}\b/, "")
.(\B.)* is instead collecting each character, one at a time, always checking that it's not the first letter of a word (i.e. is proceeded by a non-word boundary).

Regular expression returns only one match

I have a set of keywords. Any keyword can contain a space symbol ['one', 'one two']. I generate a regexp from these kyewords like this /\b(?i:one|one\ two|three)\b/. Full example below:
keywords = ['one', 'one two', 'three']
re = /\b(?i:#{ Regexp.union(keywords).source })\b/
text = 'Some word one and one two other word'
text.downcase.scan(re)
the result of this code is
=> ["one", "one"]
How to find match of the second keyword one two and get result like this?
=> ["one", "one two"]
Regexes are eager to match. Once they find a match, they don't try to find another possibly longer one (with one important exception).
/\b(?i:one|one\ two|three)\b/ is never going to match one two because it will always match one first. You'd need /\b(?i:one two|one|three)\b/ so it tries one two first. Probably the simplest way to automate this is to sort by the longest keywords first.
keywords = ['one', 'one two', 'three']
re = Regexp.union(keywords.sort { |a,b| b.length <=> a.length }).source
re = /\b#{re}\b/i;
text = 'Some word one and one two other word'
puts text.scan(re)
Note that I set the whole regex to be case-insensitive, easier to read than (?:...), and that downcasing the string is redundant.
The exception is repetition like +, * and friends. They are greedy by default. .+ is going to match as many characters as it can. That's greedy. You can make it lazy, to match the first thing it sees, with a ?. .+? will match a single character.
"A foot of fools".match(/(.*foo)/); # matches "A foot of foo"
"A foot of fools".match(/(.*?foo)/); # matches "A foo"
The point is that \bone\b matches one in one two and since this branch appears before one two branch, it "wins" (see Remember That The Regex Engine Is Eager).
You need to sort the keyword array in a descending order before building a regex. It will then look like
(?-mix:\b(?i:three|one\ two|one)\b)
This way the longer one two will be before the shorter one and will get matched.
See the Ruby demo:
keywords = ['one', 'one two', 'three']
keywords = keywords.dup.sort.reverse
re = /\b(?i:#{ Regexp.union(keywords).source })\b/
text = 'Some word one and one two other word'
puts text.downcase.scan(re)
# => [ one, one two ]
I tried your example by moving the first element to the second position of the array and it works (e.g. http://rubular.com/r/4F2Hc46wHT).
In fact, it looks like the first keyword "overlaps" the second.
This response may be unhelpful if you can't change keywords order.

Use regular expression to fetch 3 groups from string

This is my expected result.
Input a string and get three returned string.
I have no idea how to finish it with Regex in Ruby.
this is my roughly idea.
match(/(.*?)(_)(.*?)(\d+)/)
Input and expected output
# "R224_OO2003" => R224, OO, 2003
# "R2241_OOP2003" => R2244, OOP, 2003
If the example description I gave in my comment on the question is correct, you need a very straightforward regex:
r = /(.+)_(.+)(\d{4})/
Then:
"R224_OO2003".scan(r).flatten #=> ["R224", "OO", "2003"]
"R2241_OOP2003".scan(r).flatten #=> ["R2241", "OOP", "2003"]
Assuming that your three parts consist of (R and one or more digits), then an underbar, then (one or more non-whitespace characters), before finally (a 4-digit numeric date), then your regex could be something like this:
^(R\d+)_(\S+)(\d{4})$
The ^ indicates start of string, and the $ indicates end of string. \d+ indicates one or more digits, while \S+ says one or more non-whitespace characters. The \d{4} says exactly four digits.
To recover data from the matches, you could either use the pre-defined globals that line up with your groups, or you could could use named captures.
To use the match globals just use $1, $2, and $3. In general, you can figure out the number to use by counting the left parentheses of the specific group.
To use the named captures, include ? right after the left paren of a particular group. For example:
x = "R2241_OOP2003"
match_data = /^(?<first>R\d+)_(?<second>\S+)(?<third>\d{4})$/.match(x)
puts match_data['first'], match_data['second'], match_data['third']
yields
R2241
OOP
2003
as expected.
As long as your pattern covers all possibilities, then you just need to use the match object to return the 3 strings:
my_match = "R224_OO2003".match(/(.*?)(_)(.*?)(\d+)/)
#=> #<MatchData "R224_OO2003" 1:"R224" 2:"_" 3:"OO" 4:"2003">
puts my_match[0] #=> "R224_OO2003"
puts my_match[1] #=> "R224"
puts my_match[2] #=> "_"
puts my_match[3] #=> "00"
puts my_match[4] #=> "2003"
A MatchData object contains an array of each match group starting at index [1]. As you can see, index [0] returns the entire string. If you don't want the capture the "_" you can leave it's parentheses out.
Also, I'm not sure you are getting what you want with the part:
(.*?)
this basically says one or more of any single character followed by zero or one of any single character.

How does this regexp split on the first vowel?

This code splits a word into two strings at the first vowel. Why?
word = "banana"
parts = word.split(/([aeiou].*)/)
The key here is the regular expression (or regex) that is being used between the two /'s
[aeiou] says to look for the first instance of one of those characters.
. matches any single character
* modifies the previous thing to mean match 0 or more of it
(...) means capture everything enclosed between the parentheses
Translated to english this regular expression might read something like "Given a string, find the first vowel that is followed by zero or more characters. Collect that vowel and its following characters and set them aside."
The slightly more confusing part is the regex's interaction with the split method. The value the regex returns is 'anana'. And we can see that calling split with 'anana' doesn't have the same result:
'banana'.split('anana') #=> ["b"]
But when split is called with a regular expression that uses a capture group - or parentheses (...), then anything in that capture group will also be returned in the result of the split. Which is why:
'banana'.split /([aeiou].*)/ #=> ["b", "anana"]
If you want to learn more about how regular expressions work (particularly in ruby), Rubular is a great resource to fiddle with - http://www.rubular.com/r/XEUgPhOdlH
This is actually a bit tricky. This regexp
/[aeiou].*/
matches the string from the first vowel to the end of the string i.e. "anana". But if you were to split on that, you would only get the first letter since split doesn't include the splitting pattern:
"banana".split /[aeiou].*/
# ["b"]
But according to the String#split docs, if the splitting pattern is a regexp with a capture group, the capture groups are included in the result as well. Since the whole pattern is wrapped in a capture group, the result is that the string splits before the first vowel.
For example, if you change the regexp to have two capture groups, it splits further:
"banana".split /([aeiou])(.*)/
# ["b", "a", "nana"]
ANSWER FOR OLD TITLE
It's not really a Ruby's syntax, it's a standard Regular Expression's syntax that also implemented by Ruby.
* means zero or more of previous item
. means any character
[aeiou] means any character inside the brace
() means capture it
So that regex means: capture anything that starts with a, e, i, o, or u.
the word.split(/([aeiou].*)/) means, split the word variable based on anything that starts with letter a, e, i, o, or u.
See here fore more information.
ANSWER FOR NEW TITLE
Why does it split on the first vowel? It's not really like that.. What it does is, split by anything that start with vowels and capture it (the string that starts with vowels) also, see more example here:
word = 'banana'
word.split /[aeiou]/ # split by vowels
#=> ["b", "n", "n"]
word.split /([aeiou])/ # split by vowels and capture the vowels
#=> ["b", "a", "n", "a", "n", "a"]
word.split /[aeiou].*/ # split by anything that start with vowels
#=> ["b"]
word.split /([aeiou].*)/ # split by anything that start with vowels and capture the thing that start with vowels also
#=> ["b", "anana"]
ANSWER FOR OLD TITLE
If the * symbol not inside the regular expression // (Ruby's syntax), there are some possibilities:
multiplication 2 * 3 == 6, 'na' * 3 == 'nanana' # batman!
splat operation [*(1..4)] == [1,2,3,4], see more info here

Regex replace pattern with first char of match & second char in caps

Let's say i have the following string:
"a test-eh'l"
I want to capitalize the start of each word. A word can be separated by a space, apostrophe, hyphen, a forward slash, a period, etc. So I want the string to turn out like this:
"A Test-Eh'L"
I'm not too worried about getting the first character capitalized from the gsub call, as that's easy to do after the fact. However, when I've been using IRB and match method, I only seem to be getting one result. When i use a scan, it collects the matches, but the problem is I cannot really do much with it, as i need to replace the contents of the original string.
Here's what i have so far:
"a test-eh'a".scan(/[\s|\-|\'][a-z]/)
=> [" t", "-e", "'a"]
"a test-eh'a".match(/[\s|\-|\'][a-z]/)
=> #<MatchData " t">
Then if i try the pattern using gsub:
"a test-eh'a".gsub(/[\s|\-|\'][a-z]/, $1)
TypeError: can't convert nil into String
In javascript, i would normally use parenthesis instead of square brackets on the front section. However, i wasn't getting correct results in the scan call when doing so.
"a test-eh'a".scan(/(\s|\-|\')[a-z]/)
=> [[" "], ["-"], ["'"]]
"a test-eh'a".gsub(/(\s|\-|\')[a-z]/, $1)
=> "a'est'h'"
Any help would be appreciated.
Try this:
"a test-eh'a".gsub(/(?:^|\s|-|')[a-z]/) { |r| r.upcase }
# => "A Test-Eh'A"

Resources