Why is this Regex result unexpected

Why is this Regex result unexpected - ruby

The regex in question is
/(<iframe.*?><\/iframe>)/
I am using this ruby regex to match sections of a string then creating an array of the results.
The string is
"<p><iframe src=\"http://www.dailymotion.com/embed/video/k18WBkRTMldXzB7JYW5?logo=0&info=0\" frameborder=\"0\" height=\"450\" width=\"580\"></iframe></p>\n<p>#1<br />\n<iframe src=\"https://www.cloudy.ec/embed.php?id=cabe5d3ba31da\" allowfullscreen=\"\" frameborder=\"0\" height=\"420\" width=\"640\"></iframe></p>\n<p>#2<br />\n<iframe src=\"https://www.cloudy.ec/embed.php?id=b03d31e4b5663\" allowfullscreen=\"\" frameborder=\"0\" height=\"420\" width=\"640\"></iframe></p>\n<p>#3<br />\n<iframe src=\"https://www.cloudy.ec/embed.php?id=f63895add1aac\" allowfullscreen=\"\" frameborder=\"0\" height=\"420\" width=\"640\"></iframe></p>\n"
I am calling the regex is .match() like so
/(<iframe.*?><\/iframe>)/.match(entry.content).to_a
The result is a duplicate of the first match
["<iframe src=\"http://www.dailymotion.com/embed/video/k18WBkRTMldXzB7JYW5?logo=0&info=0\" frameborder=\"0\" height=\"450\" width=\"580\"></iframe>", "<iframe src=\"http://www.dailymotion.com/embed/video/k18WBkRTMldXzB7JYW5?logo=0&info=0\" frameborder=\"0\" height=\"450\" width=\"580\"></iframe>"]
I used Rubular and I was able to get the Regex to work there http://rubular.com/r/CYF0vgQtrX

The result is a duplicate of the first match
Even though the docs for Regex#match() do a horrible job of describing what match() does, it actually finds the first match:
str = "abc"
md = /./.match(str)
p md.to_a
--output:--
["a"]
Regexp.match() returns a MatchData object when there is a match. A MatchData object contains matches for the whole match and for each group. If you call to_a() on a MatchData object, the return value is an Array containing the whole match and whatever matched each group in the regex:
str = "abc"
md = /(.)(.)(.)/.match(str)
p md.to_a
--output:--
["abc", "a", "b", "c"]
Because you specified a group in your regex, one result is the whole match, and the other result is what matched your group.
[A regex] was the first approach I thought of. If this wasn't going to
work, then I was going to use nokogiri
From now on, nokogiri should be your first thought...because:
If you have a programming problem, and you think, "I'll use a regex",
now you have two problems".

You should use scan instead of match here.
entry.content.scan(/<iframe.*?><\/iframe>/)
Using /(<iframe.*?><\/iframe>)/ will get a 2d array. The document says:
If the pattern contains groups, each individual result is itself an array containing one entry per group.

Related

Regular expression returns only one match

I have a set of keywords. Any keyword can contain a space symbol ['one', 'one two']. I generate a regexp from these kyewords like this /\b(?i:one|one\ two|three)\b/. Full example below:
keywords = ['one', 'one two', 'three']
re = /\b(?i:#{ Regexp.union(keywords).source })\b/
text = 'Some word one and one two other word'
text.downcase.scan(re)
the result of this code is
=> ["one", "one"]
How to find match of the second keyword one two and get result like this?
=> ["one", "one two"]

Regexes are eager to match. Once they find a match, they don't try to find another possibly longer one (with one important exception).
/\b(?i:one|one\ two|three)\b/ is never going to match one two because it will always match one first. You'd need /\b(?i:one two|one|three)\b/ so it tries one two first. Probably the simplest way to automate this is to sort by the longest keywords first.
keywords = ['one', 'one two', 'three']
re = Regexp.union(keywords.sort { |a,b| b.length <=> a.length }).source
re = /\b#{re}\b/i;
text = 'Some word one and one two other word'
puts text.scan(re)
Note that I set the whole regex to be case-insensitive, easier to read than (?:...), and that downcasing the string is redundant.
The exception is repetition like +, * and friends. They are greedy by default. .+ is going to match as many characters as it can. That's greedy. You can make it lazy, to match the first thing it sees, with a ?. .+? will match a single character.
"A foot of fools".match(/(.*foo)/); # matches "A foot of foo"
"A foot of fools".match(/(.*?foo)/); # matches "A foo"

The point is that \bone\b matches one in one two and since this branch appears before one two branch, it "wins" (see Remember That The Regex Engine Is Eager).
You need to sort the keyword array in a descending order before building a regex. It will then look like
(?-mix:\b(?i:three|one\ two|one)\b)
This way the longer one two will be before the shorter one and will get matched.
See the Ruby demo:
keywords = ['one', 'one two', 'three']
keywords = keywords.dup.sort.reverse
re = /\b(?i:#{ Regexp.union(keywords).source })\b/
text = 'Some word one and one two other word'
puts text.downcase.scan(re)
# => [ one, one two ]

I tried your example by moving the first element to the second position of the array and it works (e.g. http://rubular.com/r/4F2Hc46wHT).
In fact, it looks like the first keyword "overlaps" the second.
This response may be unhelpful if you can't change keywords order.

opposite of sub in ruby

I want to replace the content (or delete it) that does not match with my filter.
I think the perfect description would be an opposite sub. I cannot find anything similar in the docs, and I'm not sure how to invert the regex, but I think a method would probably be the more convenient.
An example of how it would work (I've just changed the words to make it more clear)
"bird.cats.dogs".opposite_sub(/(dogs|cats)\.(dogs|cats)/, '')
#"cats.dogs"
I hope it's easy enough to understand.
Thanks in advance.

String#[] can take a regular expression as its parameter:
▶ "bird.cats.dogs"[/(dogs|cats)\.(dogs|cats)/]
#⇒ "cats.dogs"
For multiple matches one can use String#scan:
▶ "bird.cats.dogs.bird.cats.dogs".scan /(?:dogs|cats)\.(?:dogs|cats)/
#⇒ ["cats.dogs", "cats.dogs"]

So you want to extract the part that matches your regex?
You can use String#slice, for example:
"bird.cats.dogs".slice(/(dogs|cats)\.(dogs|cats)/)
#=> "cats.dogs"
And String#[] does the same.
"bird.cats.dogs"[/(dogs|cats)\.(dogs|cats)/]
#=> "cats.dogs"

You cannot have a single replacement string because the part of the string that matches the regex might not be at the beginning or end of the string, in which case it's not clear whether the replacement string should precede or follow the matching string. I've therefore written the following with two replacement strings, one for pre-match, the other for post_match. I've made this a method of the String class as that's what you've asked for (though I've given the method a less-perfect name :-) )
class String
def replace_non_matching(regex, replace_before, replace_after)
first, match, last = partition(regex)
replace_before + match + replace_after
end
end
r = /(dogs|cats)\.(dogs|cats)/
"birds.cats.dogs.pigs".replace_non_matching(r, "", "")
#=> "cats.dogs"
"birds.cats.dogs".replace_non_matching(r, "snakes.", ".hens")
#=> "snakes.cats.dogs.hens"
"birds.cats.dogs.mice.cats.dogs.bats".replace_non_matching(r, "snakes.", ".hens")
#=> "snakes.cats.dogs.hens"
Regarding the last example, the method could be modified to replace "birds.", ".mice." and ".bats", but in that case three replacement strings would be needed. In general, determining in advance the number of replacement strings needed could be problematic.

Use regular expression to fetch 3 groups from string

This is my expected result.
Input a string and get three returned string.
I have no idea how to finish it with Regex in Ruby.
this is my roughly idea.
match(/(.*?)(_)(.*?)(\d+)/)
Input and expected output
# "R224_OO2003" => R224, OO, 2003
# "R2241_OOP2003" => R2244, OOP, 2003

If the example description I gave in my comment on the question is correct, you need a very straightforward regex:
r = /(.+)_(.+)(\d{4})/
Then:
"R224_OO2003".scan(r).flatten #=> ["R224", "OO", "2003"]
"R2241_OOP2003".scan(r).flatten #=> ["R2241", "OOP", "2003"]

Assuming that your three parts consist of (R and one or more digits), then an underbar, then (one or more non-whitespace characters), before finally (a 4-digit numeric date), then your regex could be something like this:
^(R\d+)_(\S+)(\d{4})$
The ^ indicates start of string, and the $ indicates end of string. \d+ indicates one or more digits, while \S+ says one or more non-whitespace characters. The \d{4} says exactly four digits.
To recover data from the matches, you could either use the pre-defined globals that line up with your groups, or you could could use named captures.
To use the match globals just use $1, $2, and $3. In general, you can figure out the number to use by counting the left parentheses of the specific group.
To use the named captures, include ? right after the left paren of a particular group. For example:
x = "R2241_OOP2003"
match_data = /^(?<first>R\d+)_(?<second>\S+)(?<third>\d{4})$/.match(x)
puts match_data['first'], match_data['second'], match_data['third']
yields
R2241
OOP
2003
as expected.

As long as your pattern covers all possibilities, then you just need to use the match object to return the 3 strings:
my_match = "R224_OO2003".match(/(.*?)(_)(.*?)(\d+)/)
#=> #<MatchData "R224_OO2003" 1:"R224" 2:"_" 3:"OO" 4:"2003">
puts my_match[0] #=> "R224_OO2003"
puts my_match[1] #=> "R224"
puts my_match[2] #=> "_"
puts my_match[3] #=> "00"
puts my_match[4] #=> "2003"
A MatchData object contains an array of each match group starting at index [1]. As you can see, index [0] returns the entire string. If you don't want the capture the "_" you can leave it's parentheses out.
Also, I'm not sure you are getting what you want with the part:
(.*?)
this basically says one or more of any single character followed by zero or one of any single character.

Two strings evaluated by regex, but one of the scan results are being put into an extra array?

I can't figure out what I'm doing different in the below example. I have two string which in my perspective are similar - plain strings. For each string I have a regex, but the first regex, /\*Hi (.*) \*,/, gives me a result where the regex match is presented in 2 arrays: [["result"]]. I need my result to be presented in just 1 array: ["result"]. What am I doing differently in the 2 below examples?
✗ irb
2.0.0p247 :001 > name_line_1 = "*Hi Peter Parker *,"
=> "*Hi Peter Parker *,"
2.0.0p247 :002 > name_line_1.scan(/\*Hi (.*) \*,/)
=> [["Peter Parker"]]
2.0.0p247 :003 > name_line_2 = "Peter Parker<br />Memory Lane 60<br />0000 Gotham<br />USA<br />TEL:: 00000000000<br />peter#parker.com<br />\r"
=> "Peter Parker<br />Memory Lane 60<br />0000 Gotham<br />USA<br />TEL:: 00000000000<br />peter#parker.com<br />\r"
2.0.0p247 :004 > name_line_2.scan(/^[^<]*/)
=> ["Peter Parker"]

scan returns an array of matches. As the other answers point out, if your regex has capturing groups (parentheses), that means each match will return an array, with one string for each capturing group within the match.
If it didn't do this, scan wouldn't be very useful, as it is very common to use capturing groups in a regex to pick out different parts of the match.
I suspect that scan is not really the best method for your situation. scan is useful when you want to get all the matches from a string. But in the string you show, there is only one match anyways. If you want to get a specific capturing group from the first match in a string, the easiest way is:
string[/regex/, 1] # extract the first capturing group, or nil if there is no match
Another way is to do something like this:
if string =~ /regex/
# $1 will contain the first capturing group from the first match
Or:
if match = string.match(/regex/)
# match[1] will contain the first capturing group
If you really want to get all matches in the string, and need to use a capturing group (or feel it's more readable than using lookahead and lookbehind, which it is):
string.scan(/regex/) do |match|
# do something with match[0]
end
Or:
string.scan(/regex/).map(&:first)

Its because you are capturing the name in name_line_1 using parentheses. This causes the scan method to return an array of arrays. If you absolutely must return a 1 dimensional array, you can use forward and backward checking like so:
/(?<=\*Hi ).*(?= \*,)/
Or, if you find that too confusing, you could always just call .flatten on the resulting array ;-)

The difference is that, in the first regex, you have captured substring (). When a regex matches, the whole match is captured as $&, and in addition to that, you can capture parts of it as many as you want by using (). They will be captured as $1, $2, ...
And scan behaves differently depending whether you have $1, $2, ... When you don't, then it returns an array of all $&s. When you do have $1, $2, ..., then it returns an array of [$1, $2, ...].
In order to avoid $1 in the first regex, you have to avoid using captured substring:

What's the splat doing here?

match, text, number = *"foobar 123".match(/([A-z]*) ([0-9]*)/)
I know this is doing some kind of regular expression match but what role does the splat play here and is there a way to do this without the splat so it's less confusing?

The splat is decomposing the regex match results (a MatchData with three groups: the whole pattern, the letters, and the numbers) into three variables. So we end up with:
match = "foobar 123"
text = "foobar"
number = "123"
Without the splat, there'd only be the one result (the MatchData) so Ruby wouldn't know how to assign it to the three separate variables.

is there a way to do this without the splat so it's less confusing?
Since a,b = [c,d] is the same as a,b = *[c,d] and splat calls to_a on its operand when it's not an array you could simply call to_a explicitly and not need the splat:
match, text, number = "foobar 123".match(/([A-z]*) ([0-9]*)/).to_a
Don't know whether that's less confusing, but it's splatless.

There's a good explanation in the documentation for MatchData:
Because to_a is called when expanding
*variable, there‘s a useful assignment shortcut for extracting matched
fields. This is slightly slower than
accessing the fields directly (as an
intermediate array is generated).
all,f1,f2,f3 = *(/(.)(.)(\d+)(\d)/.match("THX1138."))
all #=> "HX1138"
f1 #=> "H"
f2 #=> "X"
f3 #=> "113"

String.match returns a MatchData object, which contains all the matches of the regular expression. The splat operator splits this object and returns all the matches separately.
If you just run
"foobar 123".match(/([A-z]*) ([0-9]*)/)
in irb, you can see the MatchData object, with the matches collected.

MatchData is a special variable, for all intents and purposes an array (kind of) so you can in fact do this as well:
match, text, number = "foobar 123".match(/([A-z]*) ([0-9]*)/)[0..2]
Learn more about the special variable MatchData

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Why is this Regex result unexpected - ruby

You should use scan instead of match here. entry.content.scan(/<iframe.?><\/iframe>/) Using /(<iframe.?><\/iframe>)/ will get a 2d array. The document says: If the pattern contains groups, each individual result is itself an array containing one entry per group.

Related

Regular expression returns only one match

opposite of sub in ruby

Use regular expression to fetch 3 groups from string

Two strings evaluated by regex, but one of the scan results are being put into an extra array?

What's the splat doing here?

Categories

Resources

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Why is this Regex result unexpected - ruby

You should use scan instead of match here. entry.content.scan(/<iframe.*?><\/iframe>/) Using /(<iframe.*?><\/iframe>)/ will get a 2d array. The document says: If the pattern contains groups, each individual result is itself an array containing one entry per group.

Related

Regular expression returns only one match

opposite of sub in ruby

Use regular expression to fetch 3 groups from string

Two strings evaluated by regex, but one of the scan results are being put into an extra array?

What's the splat doing here?

Categories

Resources

You should use scan instead of match here. entry.content.scan(/<iframe.?><\/iframe>/) Using /(<iframe.?><\/iframe>)/ will get a 2d array. The document says: If the pattern contains groups, each individual result is itself an array containing one entry per group.