ruby match and scan not matching a pattern the same way? - ruby

I'm trying to parse out some inforamtion from multiple records. One of the items I'm interested in can have multiple entries in a string. My thought was just to return an array of all the matching values, but I'm having trouble with the results. For example:
> s = '>ctg7180000000043_1204 selected_feature: CDS loc=299156..299605;/db_xref="GO:0007155";/db_xref="GO:0009289";'
=> ">ctg7180000000043_1204 selected_feature: CDS loc=299156..299605;/db_xref=\"GO:0007155\";/db_xref=\"GO:0009289\";"
> s.match('db_xref="[^"]+')
=> #<MatchData "db_xref=\"GO:0007155">
> s.scan('db_xref="[^"]+')
=> []
Anyway, why does match, er, match and scan does not?

String#match converts its argument to a Regexp, String#scan searches for a literal string if that's what you give it, giving #scan a Regexp gives it the same matches. Reference the ri docs for String#match and String#scan. Try the following in irb:
regex = /db_xref="[^"]+/
s.match(regex)
=> #<MatchData "db_xref=\"GO:0007155">
s.scan(regex)
=> ["db_xref=\"GO:0007155", "db_xref=\"GO:0009289"]
scan will also continue to match over the entire string, while match stops at the first pattern (you can then give it a start offset to continue if you need).

Related

Deleting a substring from a string in Ruby

For destructively deleting substrings from a string by match with a regex or a string (not a character range used for tr), there is one way to do it:
string.gsub!(regex_or_string_pattern, "")
string # => ...
I thought this can be replaced by the following code:
string.slice!(regex_or_string_pattern)
string # => ...
However, testing them with some examples seem to indicate that they are not equivalent. When do they end up with different results?
Because gsub! is "Global Substitution". If there are more than one matches to your string_or_regex_pattern, gsub will replace all of them with "". However slice! will only slice out the first match.

Why is this Regex result unexpected

The regex in question is
/(<iframe.*?><\/iframe>)/
I am using this ruby regex to match sections of a string then creating an array of the results.
The string is
"<p><iframe src=\"http://www.dailymotion.com/embed/video/k18WBkRTMldXzB7JYW5?logo=0&info=0\" frameborder=\"0\" height=\"450\" width=\"580\"></iframe></p>\n<p>#1<br />\n<iframe src=\"https://www.cloudy.ec/embed.php?id=cabe5d3ba31da\" allowfullscreen=\"\" frameborder=\"0\" height=\"420\" width=\"640\"></iframe></p>\n<p>#2<br />\n<iframe src=\"https://www.cloudy.ec/embed.php?id=b03d31e4b5663\" allowfullscreen=\"\" frameborder=\"0\" height=\"420\" width=\"640\"></iframe></p>\n<p>#3<br />\n<iframe src=\"https://www.cloudy.ec/embed.php?id=f63895add1aac\" allowfullscreen=\"\" frameborder=\"0\" height=\"420\" width=\"640\"></iframe></p>\n"
I am calling the regex is .match() like so
/(<iframe.*?><\/iframe>)/.match(entry.content).to_a
The result is a duplicate of the first match
["<iframe src=\"http://www.dailymotion.com/embed/video/k18WBkRTMldXzB7JYW5?logo=0&info=0\" frameborder=\"0\" height=\"450\" width=\"580\"></iframe>", "<iframe src=\"http://www.dailymotion.com/embed/video/k18WBkRTMldXzB7JYW5?logo=0&info=0\" frameborder=\"0\" height=\"450\" width=\"580\"></iframe>"]
I used Rubular and I was able to get the Regex to work there http://rubular.com/r/CYF0vgQtrX
The result is a duplicate of the first match
Even though the docs for Regex#match() do a horrible job of describing what match() does, it actually finds the first match:
str = "abc"
md = /./.match(str)
p md.to_a
--output:--
["a"]
Regexp.match() returns a MatchData object when there is a match. A MatchData object contains matches for the whole match and for each group. If you call to_a() on a MatchData object, the return value is an Array containing the whole match and whatever matched each group in the regex:
str = "abc"
md = /(.)(.)(.)/.match(str)
p md.to_a
--output:--
["abc", "a", "b", "c"]
Because you specified a group in your regex, one result is the whole match, and the other result is what matched your group.
[A regex] was the first approach I thought of. If this wasn't going to
work, then I was going to use nokogiri
From now on, nokogiri should be your first thought...because:
If you have a programming problem, and you think, "I'll use a regex",
now you have two problems".
You should use scan instead of match here.
entry.content.scan(/<iframe.*?><\/iframe>/)
Using /(<iframe.*?><\/iframe>)/ will get a 2d array. The document says:
If the pattern contains groups, each individual result is itself an array containing one entry per group.

Two strings evaluated by regex, but one of the scan results are being put into an extra array?

I can't figure out what I'm doing different in the below example. I have two string which in my perspective are similar - plain strings. For each string I have a regex, but the first regex, /\*Hi (.*) \*,/, gives me a result where the regex match is presented in 2 arrays: [["result"]]. I need my result to be presented in just 1 array: ["result"]. What am I doing differently in the 2 below examples?
✗ irb
2.0.0p247 :001 > name_line_1 = "*Hi Peter Parker *,"
=> "*Hi Peter Parker *,"
2.0.0p247 :002 > name_line_1.scan(/\*Hi (.*) \*,/)
=> [["Peter Parker"]]
2.0.0p247 :003 > name_line_2 = "Peter Parker<br />Memory Lane 60<br />0000 Gotham<br />USA<br />TEL:: 00000000000<br />peter#parker.com<br />\r"
=> "Peter Parker<br />Memory Lane 60<br />0000 Gotham<br />USA<br />TEL:: 00000000000<br />peter#parker.com<br />\r"
2.0.0p247 :004 > name_line_2.scan(/^[^<]*/)
=> ["Peter Parker"]
scan returns an array of matches. As the other answers point out, if your regex has capturing groups (parentheses), that means each match will return an array, with one string for each capturing group within the match.
If it didn't do this, scan wouldn't be very useful, as it is very common to use capturing groups in a regex to pick out different parts of the match.
I suspect that scan is not really the best method for your situation. scan is useful when you want to get all the matches from a string. But in the string you show, there is only one match anyways. If you want to get a specific capturing group from the first match in a string, the easiest way is:
string[/regex/, 1] # extract the first capturing group, or nil if there is no match
Another way is to do something like this:
if string =~ /regex/
# $1 will contain the first capturing group from the first match
Or:
if match = string.match(/regex/)
# match[1] will contain the first capturing group
If you really want to get all matches in the string, and need to use a capturing group (or feel it's more readable than using lookahead and lookbehind, which it is):
string.scan(/regex/) do |match|
# do something with match[0]
end
Or:
string.scan(/regex/).map(&:first)
Its because you are capturing the name in name_line_1 using parentheses. This causes the scan method to return an array of arrays. If you absolutely must return a 1 dimensional array, you can use forward and backward checking like so:
/(?<=\*Hi ).*(?= \*,)/
Or, if you find that too confusing, you could always just call .flatten on the resulting array ;-)
The difference is that, in the first regex, you have captured substring (). When a regex matches, the whole match is captured as $&, and in addition to that, you can capture parts of it as many as you want by using (). They will be captured as $1, $2, ...
And scan behaves differently depending whether you have $1, $2, ... When you don't, then it returns an array of all $&s. When you do have $1, $2, ..., then it returns an array of [$1, $2, ...].
In order to avoid $1 in the first regex, you have to avoid using captured substring:

Chaining array into new split function call

I have the following and am trying to split on '.' and then split the returned first part on '-' and return the last of the first part. I want to return 447.
a="cat-vm-447.json".split('.').split('-')
Also, how would I do this as a regular expression? I have this:
a="cat-vm-447.json".split(/-[\d]+./)
but this is splitting on the value. I want to return the number.
I can do this:
a="cat-vm-447.json".slice(/[\d]+/)
and this gives me back 447 but would really like to specify that the - and . surround it. Adding those in regex return them.
First question. Split returns an array, so you need to use Array#[] to get first(0) or last(-1) elements of this array. Alternatives is Array#first and Array#last methods.
a="cat-vm-447.json".split('.')[0].split('-')[-1] # => "447"
Second question. You can match your number into group and then get it from the response (it will have index 1. Item with index 0 will be full match ("-447." in your case). You can use String#[] or String#match (among others) methods to match your regex.
"cat-vm-447.json"[/-(\d+)\./, 1] # => "447"
# or
"cat-vm-447.json".match(/-(\d+)\./)[1] # => "447"
Split returns an array, so you need to specify the index for the next split.
a="cat-vm-447.json".split('.').first.split('-').last
For the regular expression, you need to wrap what you want to capture in parentheses.
/-(\d+)\./
a = "cat-vm-447.json"
b = a.match(/-(\d+)\./)
p b[0] # => 447
Try something like that:
if "cat-vm-447.json" =~ /([\d]+)/
p $1
else
p "No matches"
end
The parentheses in the regex extract the result in the $1 variable.
When you split your string second time, you actually trying to split Array instead of String.
ruby-1.9.3-head :003 > "cat-vm-447.json".split('.')
# => ["cat-vm-447", "json"]
In regexp case, you can use /[-.]/
ruby-1.9.3-head :008 > "cat-vm-447.json".split(/[-.]/)
# => ["cat", "vm", "447", "json"]
ruby-1.9.3-head :009 > "cat-vm-447.json".split(/[-.]/)[2]
# => "447"

Regular expression to match a pattern either at the beginning of the line or after a space character

I've been trying to dry up the following regexp that matches hashtags in a string with no success:
/^#(\w+)|\s#(\w+)/i
This won't work:
/^|\s#(\w+)/i
And no, I don't want to comma the alternation at the beginning:
/(^|\s)#(\w+)/i
I'm doing this in Ruby - though that should not be relevant I suppose.
To give some examples of matching and non-matching strings:
'#hashtag it is' # should match => [["hashtag"]]
'this is a #hashtag' # should match => [["hashtag"]]
'this is not a#hashtag' # should not match => []
Any suggestions? Am I nitpicking?
You can use.
/\B#(\w+)/i
"this is a #hash tag" # matches
"#hash tag" # matches
"this is not#hash tag" # doesn't match
/(?:^|\s)#(\w+)/i
Adding the ?: prefix to the first group will cause it to not be a matching group, thus only the second group will actually be a matchgroup. Thus, each match of the string will have a single capturing group, the contents of which will be the hashtag.
This uses look-behind and I don't know if look behinds are supported in Ruby (I heard that they are not supported in JavaScript)
/(^#(\w+))|((?<= )#(\w+))/

Resources