How does this regexp split on the first vowel?

How does this regexp split on the first vowel? - ruby

This code splits a word into two strings at the first vowel. Why?
word = "banana"
parts = word.split(/([aeiou].*)/)

The key here is the regular expression (or regex) that is being used between the two /'s
[aeiou] says to look for the first instance of one of those characters.
. matches any single character
* modifies the previous thing to mean match 0 or more of it
(...) means capture everything enclosed between the parentheses
Translated to english this regular expression might read something like "Given a string, find the first vowel that is followed by zero or more characters. Collect that vowel and its following characters and set them aside."
The slightly more confusing part is the regex's interaction with the split method. The value the regex returns is 'anana'. And we can see that calling split with 'anana' doesn't have the same result:
'banana'.split('anana') #=> ["b"]
But when split is called with a regular expression that uses a capture group - or parentheses (...), then anything in that capture group will also be returned in the result of the split. Which is why:
'banana'.split /([aeiou].*)/ #=> ["b", "anana"]
If you want to learn more about how regular expressions work (particularly in ruby), Rubular is a great resource to fiddle with - http://www.rubular.com/r/XEUgPhOdlH

This is actually a bit tricky. This regexp
/[aeiou].*/
matches the string from the first vowel to the end of the string i.e. "anana". But if you were to split on that, you would only get the first letter since split doesn't include the splitting pattern:
"banana".split /[aeiou].*/
# ["b"]
But according to the String#split docs, if the splitting pattern is a regexp with a capture group, the capture groups are included in the result as well. Since the whole pattern is wrapped in a capture group, the result is that the string splits before the first vowel.
For example, if you change the regexp to have two capture groups, it splits further:
"banana".split /([aeiou])(.*)/
# ["b", "a", "nana"]

ANSWER FOR OLD TITLE
It's not really a Ruby's syntax, it's a standard Regular Expression's syntax that also implemented by Ruby.
* means zero or more of previous item
. means any character
[aeiou] means any character inside the brace
() means capture it
So that regex means: capture anything that starts with a, e, i, o, or u.
the word.split(/([aeiou].*)/) means, split the word variable based on anything that starts with letter a, e, i, o, or u.
See here fore more information.
ANSWER FOR NEW TITLE
Why does it split on the first vowel? It's not really like that.. What it does is, split by anything that start with vowels and capture it (the string that starts with vowels) also, see more example here:
word = 'banana'
word.split /[aeiou]/ # split by vowels
#=> ["b", "n", "n"]
word.split /([aeiou])/ # split by vowels and capture the vowels
#=> ["b", "a", "n", "a", "n", "a"]
word.split /[aeiou].*/ # split by anything that start with vowels
#=> ["b"]
word.split /([aeiou].*)/ # split by anything that start with vowels and capture the thing that start with vowels also
#=> ["b", "anana"]
ANSWER FOR OLD TITLE
If the * symbol not inside the regular expression // (Ruby's syntax), there are some possibilities:
multiplication 2 * 3 == 6, 'na' * 3 == 'nanana' # batman!
splat operation [*(1..4)] == [1,2,3,4], see more info here

Related

How to split a string in half, into two variables, in one statement?

I want to split str in half and assign each half to first and second
Like this pseudo code example:
first,second = str.split( middle )

class String
def halves
chars.each_slice(size / 2).map(&:join)
end
end
Will work, but you will need to adjust to how you want to handle odd-sized strings.
Or in-line:
first, second = str.chars.each_slice(str.length / 2).map(&:join)

first,second = str.partition(/.{#{str.size/2}}/)[1,2]
Explanation
You can use partition. Using a regex pattern to look for X amount of characters (in this case str.size / 2).
Partition returns three elements; head, match, and tail. Because we are matching on any character, the head will always be a blank string. So we only care about the match and tail hence [1,2]

Here are two ways to do that
rgx = /
(?<= # begin a positive lookbehind
\A # match the beginning of the string
.{#{str.size/2}} # match any character #{str.size/2} times
) # end positive lookbehind
/x # invoke free-spacing regex definition mode
def halves(str)
str.split(rgx)
end
first, second = halves('abcdef')
#=> ["abc", "def"]
first, second = halves('abcde')
#=> ["ab", "cde"]
The regular expression is conventionally written
/(?<=\A.{#{str.size/2}})/
Note that the regular expression matches a location between two successive characters.
def halves(str)
[str[0, str.size/2], str[str.size/2..-1]]
end
first, second = halves('abcdef')
#=> ["abc", "def"]
first, second = halves('abcde')
#=> ["ab", "cde"]

Note: This only works with even length strings.
Along the line of your pseudocode,
first, second = string[0...string.length/2], string[string.length/2...string.length]
If string is the original string.

How do I apply gsub subject to a function?

I"m using Rails 5 and Ruby 2.4. I have a function
my_function(str1, str2)
that will return true or false given two string arguments. What I would like to do is given a larger string, for instance
"a b c d"
I would like to replace two consecutive "words" (a word by my definition is a sequence of characters followed by a word boundary) with the empty string if the expression
my_function(str1, str2)
evaluates to true for those two consecutive words. So for instance, if
my_function("b", "c")
evaluates to true, I would like the above string to become
"a d"
How do I do this?
Edit: I'm including the output based on Tom Lord's answer ...
If I use
def stuff(line)
matches = line.scan(/\b((\S+?)\b.*?\b(\S+?))\b/)
matches.each do |full_match, word1, word2|
line.delete!(full_match) if word1.eql?("hello") && word2.eql?("world")
end
end
and line is
"hello world this is a test"
the resulting string line is
"tisisatst"
THis is not quite what I expected. THe result should be
" this is a test"

Edit: This is an updated answer, based on the comments below. I have left my original answer at the bottom.
Scanning a string for "two consecutive words" is a bit tricky. Your best option is probably to use the \b anchor in a regex, which signifies a "word boundary":
string_to_change = "a b c d"
matches = string_to_change.scan(/\b((\S+?)\b.*?\b(\S+?))\b/)
# => [["a b", "a", "b"], ["c d", "c", "d"]]
...Where the first string is the "full match" (including any whitespace or punctuation), the others are the two words.
To break down that regex:
\b means "word boundary". I have placed one of each side of both strings. This solution assumes that str1 and str2 are both a single word. (If they contain spaces, then I don't know what behaviour you expect?)
\S+? means "one or more non-whitespace character". (Matching non-greedily, so it will stop matching at the first word boundary).
You can then remove each "full match" from the string, if the method returns true for the two words:
matches.each do |full_match, word1, word2|
string_to_change.gsub!(full_match, '') if my_function(word1, word2)
end
One thing that's not accounted for here (you didn't specify this well in your question...) was how to handle strings containing three or more words. For example, consider the following:
"hello world this is a test"
Suppose my_function(word1, word2) returns true only for the pairs: "world", "this" and "hello", "is".
My code above will only look at the pairs: "hello", "world", "this", "is" and "a", "test". But perhaps it should actually:
Look at all pairs of words, i.e. match all words with the left- and right- hand side.
Delete pairs of words repeatedly, i.e. after the initial pair: "world this" is removed, the string should be re-scanned and then "hello is" should also be removed?
If such further enhancements are needed, then please explain them clearly in a new question (if you are struggling to solve the problem yourself).
Original answer:
str1 = "b"
str2 = "c"
string_to_change = "a b c d"
if my_function(str1, str2)
string_to_change.gsub!(/\b#{str1}\b\s+\b#{str2}\b/, "")
end
To break down that regex:
\b means "word boundary". I have placed one of each side of both strings. This solution assumes that str1 and str2 are both a single word. (If they contain spaces, then I don't know what behaviour you expect?)
\s+ means "one or more whitespace character". You may wish to tweak this to allow other punctuation too, such as a comma or full stop. A fully generic solution to this issue could in fact be:
.
string_to_change.gsub!(/\b#{str1}\b.(\B.)*#{str2}\b/, "")
# Or equivalently:
string_to_change.gsub!(/\b#{str1}\b(.\B)*.#{str2}\b/, "")
.(\B.)* is instead collecting each character, one at a time, always checking that it's not the first letter of a word (i.e. is proceeded by a non-word boundary).

Split by multiple delimiters

I'm receiving a string that contains two numbers in a handful of different formats:
"344, 345", "334,433", "345x532" and "432 345"
I need to split them into two separate numbers in an array using split, and then convert them using Integer(num).
What I've tried so far:
nums.split(/[\s+,x]/) # split on one or more spaces, a comma or x
However, it doesn't seem to match multiple spaces when testing. Also, it doesn't allow a space in the comma version shown above ("344, 345").
How can I match multiple delimiters?

You are using a character class in your pattern, and it matches only one character. [\s+,x] matches 1 whitespace, or a +, , or x. You meant to use (?:\s+|x).
However, perhaps, a mere \D+ (1 or more non-digit characters) should suffice:
"345, 456".split(/\D+/).map(&:to_i)

R1 = Regexp.union([", ", ",", "x", " "])
#=> /,\ |,|x|\ /
R2 = /\A\d+#{R1}\d+\z/
#=> /\A\d+(?-mix:,\ |,|x|\ )\d+\z/
def split_it(s)
return nil unless s =~ R2
s.split(R1).map(&:to_i)
end
split_it("344, 345") #=> [344, 345]
split_it("334,433") #=> [334, 433]
split_it("345x532") #=> [345, 532]
split_it("432 345") #=> [432, 345]
split_it("432&345") #=> nil
split_it("x32 345") #=> nil

Your original regex would work with a minor adjustment to move the '+' symbol outside the character class:
"344 ,x 345".split(/[\s,x]+/).map(&:to_i) #==> [344,345]
If the examples are actually the only formats that you'll encounter, this will work well. However, if you have to be more flexible and accommodate unknown separators between the numbers, you're better off with the answer given by Wiktor:
"344 ,x 345".split(/\D+/).map(&:to_i) #==> [344,345]
Both cases will return an array of Integers from the inputs given, however the second example is both more robust and easier to understand at a glance.

it doesn't seem to match multiple spaces when testing
Yeah, character class (square brackets) doesn't work like this. You apply quantifiers on the class itself, not on its characters. You could use | operator instead. Something like this:
.split(%r[\s+|,\s*|x])

Use regular expression to fetch 3 groups from string

This is my expected result.
Input a string and get three returned string.
I have no idea how to finish it with Regex in Ruby.
this is my roughly idea.
match(/(.*?)(_)(.*?)(\d+)/)
Input and expected output
# "R224_OO2003" => R224, OO, 2003
# "R2241_OOP2003" => R2244, OOP, 2003

If the example description I gave in my comment on the question is correct, you need a very straightforward regex:
r = /(.+)_(.+)(\d{4})/
Then:
"R224_OO2003".scan(r).flatten #=> ["R224", "OO", "2003"]
"R2241_OOP2003".scan(r).flatten #=> ["R2241", "OOP", "2003"]

Assuming that your three parts consist of (R and one or more digits), then an underbar, then (one or more non-whitespace characters), before finally (a 4-digit numeric date), then your regex could be something like this:
^(R\d+)_(\S+)(\d{4})$
The ^ indicates start of string, and the $ indicates end of string. \d+ indicates one or more digits, while \S+ says one or more non-whitespace characters. The \d{4} says exactly four digits.
To recover data from the matches, you could either use the pre-defined globals that line up with your groups, or you could could use named captures.
To use the match globals just use $1, $2, and $3. In general, you can figure out the number to use by counting the left parentheses of the specific group.
To use the named captures, include ? right after the left paren of a particular group. For example:
x = "R2241_OOP2003"
match_data = /^(?<first>R\d+)_(?<second>\S+)(?<third>\d{4})$/.match(x)
puts match_data['first'], match_data['second'], match_data['third']
yields
R2241
OOP
2003
as expected.

As long as your pattern covers all possibilities, then you just need to use the match object to return the 3 strings:
my_match = "R224_OO2003".match(/(.*?)(_)(.*?)(\d+)/)
#=> #<MatchData "R224_OO2003" 1:"R224" 2:"_" 3:"OO" 4:"2003">
puts my_match[0] #=> "R224_OO2003"
puts my_match[1] #=> "R224"
puts my_match[2] #=> "_"
puts my_match[3] #=> "00"
puts my_match[4] #=> "2003"
A MatchData object contains an array of each match group starting at index [1]. As you can see, index [0] returns the entire string. If you don't want the capture the "_" you can leave it's parentheses out.
Also, I'm not sure you are getting what you want with the part:
(.*?)
this basically says one or more of any single character followed by zero or one of any single character.

How can I match Word Boundary "or" [##]?

I can't seem to get a regex that matches either a hashtag #, an #, or a word-boundary. The goal is to break a string into Twitter-like entities and topics so:
input = "Hello #world, #ruby anotherString"
input.scan(entitiesRegex)
# => ["Hello", "#world", "#ruby", "anotherString"]
To get just the words, excluding "anotherString" which is too large, is simple:
/\b\w{3,12}\b/
will return ["Hello", "world", "ruby"]. Unfortunately this doesn't include the hashtags and #s. It seems like it should work simply with:
/[\b##]\w{3,12}\b/
but that returns ["#world", "#ruby"]. This made me realize that word boundaries are not by definition a character, so they don't fall into the category of "A single character" and, so, won't match. A few more attempts:
/\b|[##]\w{3,12}\b/
returns ["", "", "#world", "", "#ruby", "", "", ""].
/((\b|[##])\w{3,12}\b)/
matches the right things, but returns [[""], ["#"], ["#"], [""]] as expected, because the braces also mean capture everything enclosed.
/((\b|[##])\w{3,12}\b)/
kind of works. It returns [["Hello", ""], ["#world", "#"], ["#ruby", "#"]]. So now all the correct items are there, they're just located at the first element of each of the subarrays. The following snippet technically works:
input.scan(/((\b|[##])\w{3,12}\b)/).collect(&:first)
Is it possible to simplify this to match and return the correct substrings with just the regular expression not requiring the collect post-processing?

You can just use the regular expression /[##]?\b\w+\b/. That is, optionally match a # or #, followed by a word boundary (in #ruby, that boundary would be between # and ruby, in a normal word it would also match at the start of the word) and a bunch of word characters.
p "Hello #world, #ruby anotherString".scan(/[##]?\b\w+\b/)
# => ["Hello", "#world", "#ruby", "anotherString"]
Furthermore, you can adjust the number of characters a matching word should have with quantifiers. You gave an example in a comment to a deleted answer to match only #ruby by using {3,4}:
p "Hello #world, #ruby anotherString".scan(/[##]?\b\w{3,4}\b/)
# => ["#ruby"]

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How does this regexp split on the first vowel? - ruby

This code splits a word into two strings at the first vowel. Why? word = "banana" parts = word.split(/([aeiou].*)/)

Related

How to split a string in half, into two variables, in one statement?

How do I apply gsub subject to a function?

Split by multiple delimiters

Use regular expression to fetch 3 groups from string

How can I match Word Boundary "or" [##]?

Categories

Resources