Why regex works in javascript, but don't work in ruby? - ruby

text = 'http://www.site.info www.escola.ninja.br google.com.ag'
expression: (http:\/\/)?((www\.)?\w+\.\w{2,}(\.\w{2,})?)
In Javascript, this expression works, returning:
["http://www.site.info", "www.escola.ninja.br", "google.com.ag"]
Why it's not working in ruby?
For example:
using the Match method:
p text.match(/(http:\/\/)?(www\.)?\w+\.\w{2,}(\.\w{2})?/)
#<MatchData "http://www.site.info" 1:"http://" 2:"www." 3:nil>
using the Scan method:
p text.scan(/(http:\/\/)?(www\.)?\w+\.\w{2,}(\.\w{2})?/)
[["http://", "www.", nil], [nil, "www.", ".br"], [nil, nil, ".ag"]]
How can I return the following array instead?
["http://www.site.info", "www.escola.ninja.br", "google.com.ag"]

Because according to the Ruby String#scan method:
If the pattern contains groups, each individual result is itself an array containing one entry per group.
So you can simply modify the expression so that the groups are non-capturing by converting (...) to (?:...), resulting in the following expression
text.scan(/(?:http:\/\/)?(?:(?:www\.)?\w+\.\w{2,}(?:\.\w{2,})?)/)
# => ["http://www.site.info", "www.escola.ninja.br", "google.com.ag"]

The reason is that str.match(/regex/g) in JS does not keep captured substrings, see MDN String#match() reference:
If the regular expression includes the g flag, the method returns an Array containing all matched substrings rather than match objects. Captured groups are not returned.
In Ruby, you have to modify the pattern to remove redundant capturing groups and turn capturing ones into non-capturing (that is, replace unescaped ( with (?:) because otherwise, only the captured substrings will get output by the String#scan method:
If the pattern contains no groups, each individual result consists of the matched string, $&. If the pattern contains groups, each individual result is itself an array containing one entry per group.
Use
text = 'http://www.site.info www.escola.ninja.br google.com.ag'
puts text.scan(/(?:http:\/\/)?(?:www\.)?\w+\.\w{2,}(?:\.\w{2,})?/)
Output of the demo:
http://www.site.info
www.escola.ninja.br
google.com.ag

Related

Substring extraction issues with regex in Ruby

I'm attempting to do some substring extraction in Ruby through the use of regular expressions, and running into some issues with the regexp being "overly selective".
Here's the target string I'm attempting to match:
"Exam­ple strin­g with 3 numbe­rs, 2 comma­s, and 6,388­ other­ value­s that are not inclu­ded."
What I'm attempting to extract is the numerical values in the statement provided. In order to account for the comma, I came up with the expression /(\d{1,3}(,\d{1,3})*)/.
Testing the following in IRB, this is the code and result:
string = "Exam­ple strin­g with 3 numbe­rs, 2 comma­s, and 6,388­ other­ value­s that are not inclu­ded."
puts strin­g.scan(/(\­d{1,3}(,\d­{1,3})*)/)­
=> "[[\"3\", nil], [\"2\", nil], [\"6,388\", \",388\"]]"
What I'm looking for is something along the lines of ["3", "2", "6,388"]. Here's the issues I need help correcting:
Why does Ruby include nil for each match group that is not comma-delimited, and how do I adjust the regular expression/match strategy to remove that and get a "flat" array?
How do I prevent the regular expression from matching a sub-expression of the substring I'm attempting to match (that is, ",388" in "6,388")?
I did attempt to use .match(), but ran into the issue that it simply returned "3" (presumably, the first value matched) with no other information apparent. Attempting to index that with [1] or [2] resulted in nil.
If there's a capturing group in pattern, String#scan returns array of arrays to express all groups.
For each match, a result is generated and either added to the result
array or passed to the block. If the pattern contains no groups, each
individual result consists of the matched string, $&. If the pattern
contains groups, each individual result is itself an array containing
one entry per group.
By removing capturing group or by replacing (...) with non-capturing group (?:...), you will get a different result:
string = "Example string with 3 numbers, 2 commas, and 6,388 other values ..."
string.scan(/\d{1,3}(?:,\d{1,3})*/) # no capturing group
# => ["3", "2", "6,388"]

Capturing groups don't work as expected with Ruby scan method

I need to get an array of floats (both positive and negative) from the multiline string. E.g.: -45.124, 1124.325 etc
Here's what I do:
text.scan(/(\+|\-)?\d+(\.\d+)?/)
Although it works fine on regex101 (capturing group 0 matches everything I need), it doesn't work in Ruby code.
Any ideas why it's happening and how I can improve that?
See scan documentation:
If the pattern contains no groups, each individual result consists of the matched string, $&. If the pattern contains groups, each individual result is itself an array containing one entry per group.
You should remove capturing groups (if they are redundant), or make them non-capturing (if you just need to group a sequence of patterns to be able to quantify them), or use extra code/group in case a capturing group cannot be avoided.
In this scenario, the capturing group is used to quantifiy a pattern sequence, thus all you need to do is convert the capturing group into a non-capturing one by replacing all unescaped ( with (?: (there is only one occurrence here):
text = " -45.124, 1124.325"
puts text.scan(/[+-]?\d+(?:\.\d+)?/)
See demo, output:
-45.124
1124.325
Well, if you need to also match floats like .04 you can use [+-]?\d*\.?\d+. See another demo
There are cases when you cannot get rid of a capturing group, e.g. when the regex contains a backreference to a capturing group. In that case, you may either a) declare a variable to store all matches and collect them all inside a scan block, or b) enclose the whole pattern with another capturing group and map the results to get the first item from each match, c) you may use a gsub with just a regex as a single argument to return an Enumerator, with .to_a to get the array of matches:
text = "11234566666678"
# Variant a:
results = []
text.scan(/(\d)\1+/) { results << Regexp.last_match(0) }
p results # => ["11", "666666"]
# Variant b:
p text.scan(/((\d)\2+)/).map(&:first) # => ["11", "666666"]
# Variant c:
p text.gsub(/(\d)\1+/).to_a # => ["11", "666666"]
See this Ruby demo.
([+-]?\d+\.\d+)
assumes there is a leading digit before the decimal point
see demo at Rubular
If you need capture groups for a complex pattern match, but want the entire expression returned by .scan, this can work for you.
Suppose you want to get the image urls in this string perhaps from a markdown text with html image tags:
str = %(
Before
<img src="https://images.zenhubusercontent.com/11223344e051aa2c30577d9d17/110459e6-915b-47cd-9d2c-1842z4b73d71">
After
<img src="https://user-images.githubusercontent.com/111222333/75255445-f59fb800-57af-11ea-9b7a-a235b84bf150.png">).strip
You may have a regular expression defined to match just the urls, and maybe used a Rubular example like this to build/test your Regexp
image_regex =
/https\:\/\/(user-)?images.(githubusercontent|zenhubusercontent).com.*\b/
Now you don't need each sub-capture group, but just the the entire expression in your your .scan, you can just wrap the whole pattern inside a capture group and use it like this:
image_regex =
/(https\:\/\/(user-)?images.(githubusercontent|zenhubusercontent).com.*\b)/
str.scan(image_regex).map(&:first)
=> ["https://user-images.githubusercontent.com/1949900/75255445-f59fb800-57af-11ea-9b7a-e075f55bf150.png",
"https://user-images.githubusercontent.com/1949900/75255473-02bca700-57b0-11ea-852a-58424698cfb0.png"]
How does this actually work?
Since you have 3 capture groups, .scan alone will return an Array of arrays with, one for each capture:
str.scan(image_regex)
=> [["https://user-images.githubusercontent.com/111222333/75255445-f59fb800-57af-11ea-9b7a-e075f55bf150.png", "user-", "githubusercontent"],
["https://images.zenhubusercontent.com/11223344e051aa2c30577d9d17/110459e6-915b-47cd-9d2c-0714c8f76f68", nil, "zenhubusercontent"]]
Since we only want the 1st (outter) capture group, we can just call .map(&:first)

Regex matches but .scan returns nil

As you can see on Rubular the regexp <p( style=".+"){0,1}>.+<\/p> matches the string <p>aasdad</p>.
But, when I do "<p>sdasdasd</p>".scan(/<p( style=".+"){0,1}>.+<\/p>/) I get [[nil]]. Why the matched string is not included in the return value?
That's the way scan works. From the Ruby documentation for scan:
If the pattern contains groups, each individual result is itself an
array containing one entry per group.
Since the optional group ( style=".+") doesn't match you get only a nil in the result. You can use (?: for a non-capturing group:
"<p>sdasdasd</p>".scan(/<p(?: style=".+"){0,1}>.+<\/p>/)
# => ["<p>sdasdasd</p>"]
You could also try with .match
"<p>sdasdasd</p>".match(/<p( style=".+"){0,1}>.+<\/p>/)
# => <p>sdasdasd</p>

regex that returns matches of a range in length and containing one of several words in Ruby

I tried the following regex in Ruby:
"the foodogand the catlada are mouseing".scan(/\b(?=\w{6,12}\b)\w{0,9}(cat|dog|mouse)\w*/)
but instead of it returning
["foodogand", "catlada", "mouseing"]
I'm getting
[["dog"],["cat]] # the results are also in arrays
What's wrong here?
the results are also in arrays, I could flatten this but is there a way to avoid it?
Use ?: for the last group:
"the foodogand the catlada are mouseing".scan(/\b(?=\w{6,12}\b)\w{0,9}(?:cat|dog|mouse)\w*/)
#=> ["foodogand", "catlada", "mouseing"]
From the docs:
If the pattern contains groups, each individual result is itself an array containing one entry per group.
The ?: makes the group non-capturing, avoiding a nested array.
I would just clean that up a bit by moving the second \b to the end and replacing \w{0,9} with \w* (the lookahead takes care of the length)
"the foodogand the catlada are mouseing".scan /\b(?=\w{6,12})\w*(?:cat|dog|mouse)\w*\b/
#=> ["foodogand", "catlada", "mouseing"]

How to get the particular part of string matching regexp in Ruby?

I've got a string Unnecessary:12357927251data and I need to select all data after colon and numbers. I will do it using Regexp.
string.scan(/:\d+.+$/)
This will give me :12357927251data, but can I select only needed information .+ (data)?
Anything in parentheses in a regexp will be captured as a group, which you can access in $1, $2, etc. or by using [] on a match object:
string.match(/:\d+(.+)$/)[1]
If you use scan with capturing groups, you will get an array of arrays of the groups:
"Unnecessary:123data\nUnnecessary:5791next".scan(/:\d+(.+)$/)
=> [["data"], ["next"]]
Use parenthesis in your regular expression and the result will be broken out into an array. For example:
x='Unnecessary:12357927251data'
x.scan(/(:\d+)(.+)$/)
=> [[":12357927251", "data"]]
x.scan(/:\d+(.+$)/).flatten
=> ["data"]
Assuming that you are trying to get the string 'data' from your string, then you can use:
string.match(/.*:\d*(.*)/)[1]
String#match returns a MatchData object. You can then index into that MatchData object to find the part of the string that you want.
(The first element of MatchData is the original string, the second element is the part of the string captured by the parentheses)
Try this: /(?<=\:)\d+.+$/
It changes the colon to a positive look-behind so that it does not appear in the output. Note that the colon alone is a metacharacter and so must be escaped with a backslash.
Using IRB
irb(main):004:0> "Unnecessary:12357927251data".scan(/:\d+(.+)$/)
=> [["data"]]

Resources