Regex match everything up to first period - ruby

Trying to get a lazy regex match of everything up until the first period of a sentence.
e.g. Just want to get "jack and jill." from this sentence:
"jack and jill. went up the hill. to fetch a pail."
/.+\./ matches the whole sentence (example)
/(.+?\.)/ matches each instance (example)
Is there a way to just match the first instance?

/^([^.]+)/
Let's break it down,
^ is the newline anchor
[^.] this matches any character that's not a period
\+ to take until a period
And the expression is encapsulated with () to capture it.

If you only want the first occurrence, do not choose the scan method that returns all results in the string. You can use the match method that returns a MatchData instance, but you can also simply write:
> "jack and jill. went up the hill. to fetch a pail."[/.+?\./]
=> "jack and jill."

I would be inclined to use a regex, but there are other options.
str = "jack and jill. went up the hill. supposedly to fetch a pail of water."
str[0..i] if i = str.index('.')
#=> "jack and jill."
str = "three blind mice"
str[0..i] if i = str.index('.')
#=> nil

Related

Regular expression returns only one match

I have a set of keywords. Any keyword can contain a space symbol ['one', 'one two']. I generate a regexp from these kyewords like this /\b(?i:one|one\ two|three)\b/. Full example below:
keywords = ['one', 'one two', 'three']
re = /\b(?i:#{ Regexp.union(keywords).source })\b/
text = 'Some word one and one two other word'
text.downcase.scan(re)
the result of this code is
=> ["one", "one"]
How to find match of the second keyword one two and get result like this?
=> ["one", "one two"]
Regexes are eager to match. Once they find a match, they don't try to find another possibly longer one (with one important exception).
/\b(?i:one|one\ two|three)\b/ is never going to match one two because it will always match one first. You'd need /\b(?i:one two|one|three)\b/ so it tries one two first. Probably the simplest way to automate this is to sort by the longest keywords first.
keywords = ['one', 'one two', 'three']
re = Regexp.union(keywords.sort { |a,b| b.length <=> a.length }).source
re = /\b#{re}\b/i;
text = 'Some word one and one two other word'
puts text.scan(re)
Note that I set the whole regex to be case-insensitive, easier to read than (?:...), and that downcasing the string is redundant.
The exception is repetition like +, * and friends. They are greedy by default. .+ is going to match as many characters as it can. That's greedy. You can make it lazy, to match the first thing it sees, with a ?. .+? will match a single character.
"A foot of fools".match(/(.*foo)/); # matches "A foot of foo"
"A foot of fools".match(/(.*?foo)/); # matches "A foo"
The point is that \bone\b matches one in one two and since this branch appears before one two branch, it "wins" (see Remember That The Regex Engine Is Eager).
You need to sort the keyword array in a descending order before building a regex. It will then look like
(?-mix:\b(?i:three|one\ two|one)\b)
This way the longer one two will be before the shorter one and will get matched.
See the Ruby demo:
keywords = ['one', 'one two', 'three']
keywords = keywords.dup.sort.reverse
re = /\b(?i:#{ Regexp.union(keywords).source })\b/
text = 'Some word one and one two other word'
puts text.downcase.scan(re)
# => [ one, one two ]
I tried your example by moving the first element to the second position of the array and it works (e.g. http://rubular.com/r/4F2Hc46wHT).
In fact, it looks like the first keyword "overlaps" the second.
This response may be unhelpful if you can't change keywords order.

Why does the output from my map/regex block not capitalize?

I'm working through the Test First Ruby Master problems. My code for 08/book_titles is this:
class Book
attr_accessor :title
def title
if #title.include?(' ')
correct = #title.split.each_with_index.map {|x, index| ((x =~ /^a|an|of|the|or|in|and$/) && index != 0) ? x : x.capitalize}
correct.join(' ')
# this is throwing a weird error, the code looks right but isn't capitalizing last word (returns 'To Kill a mockingbird')
else #title.capitalize
end
end
end
I tested the map portion separately, and it works fine. But in the entirety of the problem, it does not capitalize as it should be. It throws an rspec error:
1) Book title should capitalize every word except... articles a
Failure/Error: expect(#book.title).to eq("To Kill a Mockingbird")
expected: "To Kill a Mockingbird"
got: "To Kill a mockingbird"
Anyone know why?
I originally didn't include ^/$ in the regex. I got the same error with a different title, and adding those anchors fixed it for that case. But then the error showed up again with the title.
Because mockingbird contains in
('mockingbird' =~ /^a|an|of|the|or|in|and$/) => 4
I think you want this regex:
/^a$|^an$|^of$|^the$|^or$|^in$|^and$/
It is not necessary to break the string into words, modify the words and join them back into a string. In fact, doing that has the disadvantage that spacing between words may be altered. Here's one way of operating on the string directly.
wee_words = ["a", "an", "of", "the", "or", "in", "and"]
str = "a dAy in the life of waltEr mITTY"
str.capitalize.gsub(/\w+/) { |s| wee_words.include?(s) ? s : s.capitalize }
#=> "A Day in the Life of Walter Mitty"
str.capitalize upcases the first letter of the string and downcases all subsequent letters. As a result, the first word will never be treated as a wee_word, since it is capitalized (e.g., wee_words.include?("The") #=> false).
The regex is slightly incorrect. The way to read it as it is can be done this way:
Match any string that
starts with 'a'
or contains 'an'
or contains 'of'
or contains 'the'
or contains 'or'
or contains 'in'
or ends in 'and'
What you really seem to want is something that reads like this:
Match any string that
only contains any of 'a', 'an', 'of', 'the', 'or', 'in', 'and'
To get this, you want your regex to be written like this:
/^(a|an|of|the|or|in|and)$/
Note the parenthesis around the alternation. (Alternation is the formal term for multiple choices in a regex, where choices are separated by '|').
If you're comparing against book or movie titles, this is much closer to the type of match you'd expect. It will match correctly for titles such as "Chariots of Fire" and "Benny and Joon", but not against falsely the 'in' of "To Kill a Mockingbird", which is a significant improvement.
However, it still won't quite work yet on something like "Benny AND Joon", because 'AND' is uppercase in this title (assuming that incoming titles may be arbitrarily mixed case). One last change will do it:
/^(a|an|of|the|or|in|and)$/i
That last letter 'i' at the end of the regex says to 'ignore case', so that matches can occur regardless of whether the 'AND' is uppercase, lowercase, or mixed case.
This should get you close to what you're trying to achieve and handle a few bumpy use cases in the process.

Use regular expression to fetch 3 groups from string

This is my expected result.
Input a string and get three returned string.
I have no idea how to finish it with Regex in Ruby.
this is my roughly idea.
match(/(.*?)(_)(.*?)(\d+)/)
Input and expected output
# "R224_OO2003" => R224, OO, 2003
# "R2241_OOP2003" => R2244, OOP, 2003
If the example description I gave in my comment on the question is correct, you need a very straightforward regex:
r = /(.+)_(.+)(\d{4})/
Then:
"R224_OO2003".scan(r).flatten #=> ["R224", "OO", "2003"]
"R2241_OOP2003".scan(r).flatten #=> ["R2241", "OOP", "2003"]
Assuming that your three parts consist of (R and one or more digits), then an underbar, then (one or more non-whitespace characters), before finally (a 4-digit numeric date), then your regex could be something like this:
^(R\d+)_(\S+)(\d{4})$
The ^ indicates start of string, and the $ indicates end of string. \d+ indicates one or more digits, while \S+ says one or more non-whitespace characters. The \d{4} says exactly four digits.
To recover data from the matches, you could either use the pre-defined globals that line up with your groups, or you could could use named captures.
To use the match globals just use $1, $2, and $3. In general, you can figure out the number to use by counting the left parentheses of the specific group.
To use the named captures, include ? right after the left paren of a particular group. For example:
x = "R2241_OOP2003"
match_data = /^(?<first>R\d+)_(?<second>\S+)(?<third>\d{4})$/.match(x)
puts match_data['first'], match_data['second'], match_data['third']
yields
R2241
OOP
2003
as expected.
As long as your pattern covers all possibilities, then you just need to use the match object to return the 3 strings:
my_match = "R224_OO2003".match(/(.*?)(_)(.*?)(\d+)/)
#=> #<MatchData "R224_OO2003" 1:"R224" 2:"_" 3:"OO" 4:"2003">
puts my_match[0] #=> "R224_OO2003"
puts my_match[1] #=> "R224"
puts my_match[2] #=> "_"
puts my_match[3] #=> "00"
puts my_match[4] #=> "2003"
A MatchData object contains an array of each match group starting at index [1]. As you can see, index [0] returns the entire string. If you don't want the capture the "_" you can leave it's parentheses out.
Also, I'm not sure you are getting what you want with the part:
(.*?)
this basically says one or more of any single character followed by zero or one of any single character.

Why is this Regex result unexpected

The regex in question is
/(<iframe.*?><\/iframe>)/
I am using this ruby regex to match sections of a string then creating an array of the results.
The string is
"<p><iframe src=\"http://www.dailymotion.com/embed/video/k18WBkRTMldXzB7JYW5?logo=0&info=0\" frameborder=\"0\" height=\"450\" width=\"580\"></iframe></p>\n<p>#1<br />\n<iframe src=\"https://www.cloudy.ec/embed.php?id=cabe5d3ba31da\" allowfullscreen=\"\" frameborder=\"0\" height=\"420\" width=\"640\"></iframe></p>\n<p>#2<br />\n<iframe src=\"https://www.cloudy.ec/embed.php?id=b03d31e4b5663\" allowfullscreen=\"\" frameborder=\"0\" height=\"420\" width=\"640\"></iframe></p>\n<p>#3<br />\n<iframe src=\"https://www.cloudy.ec/embed.php?id=f63895add1aac\" allowfullscreen=\"\" frameborder=\"0\" height=\"420\" width=\"640\"></iframe></p>\n"
I am calling the regex is .match() like so
/(<iframe.*?><\/iframe>)/.match(entry.content).to_a
The result is a duplicate of the first match
["<iframe src=\"http://www.dailymotion.com/embed/video/k18WBkRTMldXzB7JYW5?logo=0&info=0\" frameborder=\"0\" height=\"450\" width=\"580\"></iframe>", "<iframe src=\"http://www.dailymotion.com/embed/video/k18WBkRTMldXzB7JYW5?logo=0&info=0\" frameborder=\"0\" height=\"450\" width=\"580\"></iframe>"]
I used Rubular and I was able to get the Regex to work there http://rubular.com/r/CYF0vgQtrX
The result is a duplicate of the first match
Even though the docs for Regex#match() do a horrible job of describing what match() does, it actually finds the first match:
str = "abc"
md = /./.match(str)
p md.to_a
--output:--
["a"]
Regexp.match() returns a MatchData object when there is a match. A MatchData object contains matches for the whole match and for each group. If you call to_a() on a MatchData object, the return value is an Array containing the whole match and whatever matched each group in the regex:
str = "abc"
md = /(.)(.)(.)/.match(str)
p md.to_a
--output:--
["abc", "a", "b", "c"]
Because you specified a group in your regex, one result is the whole match, and the other result is what matched your group.
[A regex] was the first approach I thought of. If this wasn't going to
work, then I was going to use nokogiri
From now on, nokogiri should be your first thought...because:
If you have a programming problem, and you think, "I'll use a regex",
now you have two problems".
You should use scan instead of match here.
entry.content.scan(/<iframe.*?><\/iframe>/)
Using /(<iframe.*?><\/iframe>)/ will get a 2d array. The document says:
If the pattern contains groups, each individual result is itself an array containing one entry per group.

What's the difference between scan and match on Ruby string

I am new to Ruby and has always used String.scan to search for the first occurrence of a number. It is kind of strange that the returned value is in nested array, but I just go [0][0] for the values I want. (I am sure it has its purpose, just that I haven't used it yet.)
I just found out that there is a String.match method. And it seems to be more convenient because the returned array is not nested.
Here is an example of the two, first is scan:
>> 'a 1-night stay'.scan(/(a )?(\d*)[- ]night/i).to_a
=> [["a ", "1"]]
then is match
>> 'a 1-night stay'.match(/(a )?(\d*)[- ]night/i).to_a
=> ["a 1-night", "a ", "1"]
I have check the API, but I can't really differentiate the difference, as both referred to 'match the pattern'.
This question is, for simply out curiousity, about what scan can do that match can't, and vise versa. Any specific scenario that only one can accomplish? Is match the inferior of scan?
Short answer: scan will return all matches. This doesn't make it superior, because if you only want the first match, str.match[2] reads much nicer than str.scan[0][1].
ruby-1.9.2-p290 :002 > 'a 1-night stay, a 2-night stay'.scan(/(a )?(\d*)[- ]night/i).to_a
=> [["a ", "1"], ["a ", "2"]]
ruby-1.9.2-p290 :004 > 'a 1-night stay, a 2-night stay'.match(/(a )?(\d*)[- ]night/i).to_a
=> ["a 1-night", "a ", "1"]
#scan returns everything that the Regex matches.
#match returns the first match as a MatchData object, which contains data held by special variables like $& (what was matched by the Regex; that's what's mapping to index 0), $1 (match 1), $2, et al.
Previous answers state that scan will return every match from the string the method is called on but this is incorrect.
Scan keeps track of an index and continues looking for subsequent matches after the last character of the previous match.
string = 'xoxoxo'
p string.scan('xo') # => ['xo' 'xo' 'xo' ]
# so far so good but...
p string.scan('xox') # => ['xox']
# if this retured EVERY instance of 'xox' it would include a substring
# starting at indices 0 and 2 but only one match is found

Resources