Ruby Regex: negative lookahead with unlimited matching before - ruby

I'm trying to be able to match a phrase like:
I request a single car
// or
I request a single person
// or
I request a single coconut tree
but not
I request a single car by id
// nor
I request a single person by id with friends
// nor
I request a single coconut tree by id with coconuts
Something like this works:
/^I request a single person(?!\s+by id.*)/
for strings like this:
I request a single person
I request a single person with friends
But when I replace the person with a matcher (.*) or add the $ to the end, it stops working:
/^I request a single (.*)(?!\s+by id.*)$/
How can I accomplish this but still match in the first match everything before the negative lookahead?

There's no ) to match ( in (.*\). Perhaps that's a typo, since you tested. After fixing that, however, there's still a problem:
"I request a single car by idea" =~ /^I request a single (?!.*by id.*)(.*)$/
#=> nil
Presumably, that should be a match. If you only want to know if there's a match, you can use:
r = /^I request a single (?!.+?by id\b)/
Then:
"I request a single car by idea" =~ r #=> 0
"I request a single person by id with friends" =~ r #=> nil
\b matches a word break, which includes the case where the previous character is the last one in the string. Notice that if you are just checking for a match, there's no need to include anything beyond the negative lookahead.
If you want to return whatever follows "single " when there's a match, use:
r = /^I request a single (?!.+?by id\b)(.*)/
"I request a single coconut tree"[r,1] #=> "coconut tree"
"I request a single person by id with friends"[r,1] #=> nil

OK, I think I just got it. Right after asking the question. Instead of a creating lookahead after the thing I want to capture, I create a lookahead before the thing I want to capture, like so:
/^I request a single (?!.*by id.*)(.*[^\s])?\s*$/

Related

Match String with Number Placeholder

I would like to match the following strings: With String match: https://apidock.com/ruby/String/match
"The account 340394034 is finalized"
"The account 9394834 is finalized"
"The account 12392039483 is finalized"
"The account 3493849384 is finalized"
"The account 32984938434983 is finalized"
Which regex do I have to use to match this strings with number placeholders in it? Thanks
"The account {number_placeholder} is finalized"
This is the full regex
\d+
Depending on input, assuming there is a possibility of other numbers in the string, you could use this instead and get the contents of capture group 1:
account\s+(\d+)
If you just want to use the match method to determine whether a given string matches the pattern in your examples, you can do this:
example = "The account 32984938434983 is finalized"
if example.match(/The account \d+ is finalized/)
puts "it matched"
else
puts "it didn't match"
end
The match method returns a MatchData object (basically the part of the string that matched the regex, which in this case is the whole thing). Using it on a non-matching string will return nil, this means you can use the result of the match method for if-statements.
If you want to extract the number in the string, only if the string matches the pattern, you could do this:
example = "The account 32984938434983 is finalized"
match_result = example.match(/The account (\d+) is finalized/)
number = if match_result
match_result.captures.first.to_i
else
number = nil # or 0 / some other default value
end
The brackets in the regex form a "capture group". The captures method on the result gives an array of all the capture group matches. The first method gets the first (and in this case only) element from that array, and the to_i method converts the string into an integer.

Regex match everything up to first period

Trying to get a lazy regex match of everything up until the first period of a sentence.
e.g. Just want to get "jack and jill." from this sentence:
"jack and jill. went up the hill. to fetch a pail."
/.+\./ matches the whole sentence (example)
/(.+?\.)/ matches each instance (example)
Is there a way to just match the first instance?
/^([^.]+)/
Let's break it down,
^ is the newline anchor
[^.] this matches any character that's not a period
\+ to take until a period
And the expression is encapsulated with () to capture it.
If you only want the first occurrence, do not choose the scan method that returns all results in the string. You can use the match method that returns a MatchData instance, but you can also simply write:
> "jack and jill. went up the hill. to fetch a pail."[/.+?\./]
=> "jack and jill."
I would be inclined to use a regex, but there are other options.
str = "jack and jill. went up the hill. supposedly to fetch a pail of water."
str[0..i] if i = str.index('.')
#=> "jack and jill."
str = "three blind mice"
str[0..i] if i = str.index('.')
#=> nil

Why does the output from my map/regex block not capitalize?

I'm working through the Test First Ruby Master problems. My code for 08/book_titles is this:
class Book
attr_accessor :title
def title
if #title.include?(' ')
correct = #title.split.each_with_index.map {|x, index| ((x =~ /^a|an|of|the|or|in|and$/) && index != 0) ? x : x.capitalize}
correct.join(' ')
# this is throwing a weird error, the code looks right but isn't capitalizing last word (returns 'To Kill a mockingbird')
else #title.capitalize
end
end
end
I tested the map portion separately, and it works fine. But in the entirety of the problem, it does not capitalize as it should be. It throws an rspec error:
1) Book title should capitalize every word except... articles a
Failure/Error: expect(#book.title).to eq("To Kill a Mockingbird")
expected: "To Kill a Mockingbird"
got: "To Kill a mockingbird"
Anyone know why?
I originally didn't include ^/$ in the regex. I got the same error with a different title, and adding those anchors fixed it for that case. But then the error showed up again with the title.
Because mockingbird contains in
('mockingbird' =~ /^a|an|of|the|or|in|and$/) => 4
I think you want this regex:
/^a$|^an$|^of$|^the$|^or$|^in$|^and$/
It is not necessary to break the string into words, modify the words and join them back into a string. In fact, doing that has the disadvantage that spacing between words may be altered. Here's one way of operating on the string directly.
wee_words = ["a", "an", "of", "the", "or", "in", "and"]
str = "a dAy in the life of waltEr mITTY"
str.capitalize.gsub(/\w+/) { |s| wee_words.include?(s) ? s : s.capitalize }
#=> "A Day in the Life of Walter Mitty"
str.capitalize upcases the first letter of the string and downcases all subsequent letters. As a result, the first word will never be treated as a wee_word, since it is capitalized (e.g., wee_words.include?("The") #=> false).
The regex is slightly incorrect. The way to read it as it is can be done this way:
Match any string that
starts with 'a'
or contains 'an'
or contains 'of'
or contains 'the'
or contains 'or'
or contains 'in'
or ends in 'and'
What you really seem to want is something that reads like this:
Match any string that
only contains any of 'a', 'an', 'of', 'the', 'or', 'in', 'and'
To get this, you want your regex to be written like this:
/^(a|an|of|the|or|in|and)$/
Note the parenthesis around the alternation. (Alternation is the formal term for multiple choices in a regex, where choices are separated by '|').
If you're comparing against book or movie titles, this is much closer to the type of match you'd expect. It will match correctly for titles such as "Chariots of Fire" and "Benny and Joon", but not against falsely the 'in' of "To Kill a Mockingbird", which is a significant improvement.
However, it still won't quite work yet on something like "Benny AND Joon", because 'AND' is uppercase in this title (assuming that incoming titles may be arbitrarily mixed case). One last change will do it:
/^(a|an|of|the|or|in|and)$/i
That last letter 'i' at the end of the regex says to 'ignore case', so that matches can occur regardless of whether the 'AND' is uppercase, lowercase, or mixed case.
This should get you close to what you're trying to achieve and handle a few bumpy use cases in the process.

Why is this Regex result unexpected

The regex in question is
/(<iframe.*?><\/iframe>)/
I am using this ruby regex to match sections of a string then creating an array of the results.
The string is
"<p><iframe src=\"http://www.dailymotion.com/embed/video/k18WBkRTMldXzB7JYW5?logo=0&info=0\" frameborder=\"0\" height=\"450\" width=\"580\"></iframe></p>\n<p>#1<br />\n<iframe src=\"https://www.cloudy.ec/embed.php?id=cabe5d3ba31da\" allowfullscreen=\"\" frameborder=\"0\" height=\"420\" width=\"640\"></iframe></p>\n<p>#2<br />\n<iframe src=\"https://www.cloudy.ec/embed.php?id=b03d31e4b5663\" allowfullscreen=\"\" frameborder=\"0\" height=\"420\" width=\"640\"></iframe></p>\n<p>#3<br />\n<iframe src=\"https://www.cloudy.ec/embed.php?id=f63895add1aac\" allowfullscreen=\"\" frameborder=\"0\" height=\"420\" width=\"640\"></iframe></p>\n"
I am calling the regex is .match() like so
/(<iframe.*?><\/iframe>)/.match(entry.content).to_a
The result is a duplicate of the first match
["<iframe src=\"http://www.dailymotion.com/embed/video/k18WBkRTMldXzB7JYW5?logo=0&info=0\" frameborder=\"0\" height=\"450\" width=\"580\"></iframe>", "<iframe src=\"http://www.dailymotion.com/embed/video/k18WBkRTMldXzB7JYW5?logo=0&info=0\" frameborder=\"0\" height=\"450\" width=\"580\"></iframe>"]
I used Rubular and I was able to get the Regex to work there http://rubular.com/r/CYF0vgQtrX
The result is a duplicate of the first match
Even though the docs for Regex#match() do a horrible job of describing what match() does, it actually finds the first match:
str = "abc"
md = /./.match(str)
p md.to_a
--output:--
["a"]
Regexp.match() returns a MatchData object when there is a match. A MatchData object contains matches for the whole match and for each group. If you call to_a() on a MatchData object, the return value is an Array containing the whole match and whatever matched each group in the regex:
str = "abc"
md = /(.)(.)(.)/.match(str)
p md.to_a
--output:--
["abc", "a", "b", "c"]
Because you specified a group in your regex, one result is the whole match, and the other result is what matched your group.
[A regex] was the first approach I thought of. If this wasn't going to
work, then I was going to use nokogiri
From now on, nokogiri should be your first thought...because:
If you have a programming problem, and you think, "I'll use a regex",
now you have two problems".
You should use scan instead of match here.
entry.content.scan(/<iframe.*?><\/iframe>/)
Using /(<iframe.*?><\/iframe>)/ will get a 2d array. The document says:
If the pattern contains groups, each individual result is itself an array containing one entry per group.

ruby/regex getting the first letter of each word

I want to get the first letter of each word put together, making something like "I need help" turn into "Inh". I was thinking to trim everything off, then going from there, or grab each first letter right away.
You could simply use split, map and join together here.
string = 'I need help'
result = string.split.map(&:first).join
puts result #=> "Inh"
How about regular expressions? Using the split method here forces a focus on the parts of the string that you don't need to for this problem, then taking another step of extracting the first letter of each word (chr). that's why I think regular expressions is better for this case. Node that this will also work if you have a - or another special character in the string. And then, of course you can add .upcase method at the end to get a proper acronym.
string = 'something - something and something else'
string.scan(/\b\w/).join
#=> ssase
Alternative solution using regex
string = 'I need help'
result = string.scan(/(\A\w|(?<=\s)\w)/).flatten.join
puts result
This basically says "look for either the first letter or any letter directly preceded by a space". The scan function returns array of arrays of matches, which is flattened (made into one array) and joined (made into a string).
string = 'I need help'
result = string.split.map(&:chr).join
puts result
http://ruby-doc.org/core-2.0/String.html#method-i-chr

Resources