regex that returns matches of a range in length and containing one of several words in Ruby - ruby

I tried the following regex in Ruby:
"the foodogand the catlada are mouseing".scan(/\b(?=\w{6,12}\b)\w{0,9}(cat|dog|mouse)\w*/)
but instead of it returning
["foodogand", "catlada", "mouseing"]
I'm getting
[["dog"],["cat]] # the results are also in arrays
What's wrong here?
the results are also in arrays, I could flatten this but is there a way to avoid it?

Use ?: for the last group:
"the foodogand the catlada are mouseing".scan(/\b(?=\w{6,12}\b)\w{0,9}(?:cat|dog|mouse)\w*/)
#=> ["foodogand", "catlada", "mouseing"]
From the docs:
If the pattern contains groups, each individual result is itself an array containing one entry per group.
The ?: makes the group non-capturing, avoiding a nested array.

I would just clean that up a bit by moving the second \b to the end and replacing \w{0,9} with \w* (the lookahead takes care of the length)
"the foodogand the catlada are mouseing".scan /\b(?=\w{6,12})\w*(?:cat|dog|mouse)\w*\b/
#=> ["foodogand", "catlada", "mouseing"]

Related

Why regex works in javascript, but don't work in ruby?

text = 'http://www.site.info www.escola.ninja.br google.com.ag'
expression: (http:\/\/)?((www\.)?\w+\.\w{2,}(\.\w{2,})?)
In Javascript, this expression works, returning:
["http://www.site.info", "www.escola.ninja.br", "google.com.ag"]
Why it's not working in ruby?
For example:
using the Match method:
p text.match(/(http:\/\/)?(www\.)?\w+\.\w{2,}(\.\w{2})?/)
#<MatchData "http://www.site.info" 1:"http://" 2:"www." 3:nil>
using the Scan method:
p text.scan(/(http:\/\/)?(www\.)?\w+\.\w{2,}(\.\w{2})?/)
[["http://", "www.", nil], [nil, "www.", ".br"], [nil, nil, ".ag"]]
How can I return the following array instead?
["http://www.site.info", "www.escola.ninja.br", "google.com.ag"]
Because according to the Ruby String#scan method:
If the pattern contains groups, each individual result is itself an array containing one entry per group.
So you can simply modify the expression so that the groups are non-capturing by converting (...) to (?:...), resulting in the following expression
text.scan(/(?:http:\/\/)?(?:(?:www\.)?\w+\.\w{2,}(?:\.\w{2,})?)/)
# => ["http://www.site.info", "www.escola.ninja.br", "google.com.ag"]
The reason is that str.match(/regex/g) in JS does not keep captured substrings, see MDN String#match() reference:
If the regular expression includes the g flag, the method returns an Array containing all matched substrings rather than match objects. Captured groups are not returned.
In Ruby, you have to modify the pattern to remove redundant capturing groups and turn capturing ones into non-capturing (that is, replace unescaped ( with (?:) because otherwise, only the captured substrings will get output by the String#scan method:
If the pattern contains no groups, each individual result consists of the matched string, $&. If the pattern contains groups, each individual result is itself an array containing one entry per group.
Use
text = 'http://www.site.info www.escola.ninja.br google.com.ag'
puts text.scan(/(?:http:\/\/)?(?:www\.)?\w+\.\w{2,}(?:\.\w{2,})?/)
Output of the demo:
http://www.site.info
www.escola.ninja.br
google.com.ag

Substring extraction issues with regex in Ruby

I'm attempting to do some substring extraction in Ruby through the use of regular expressions, and running into some issues with the regexp being "overly selective".
Here's the target string I'm attempting to match:
"Exam­ple strin­g with 3 numbe­rs, 2 comma­s, and 6,388­ other­ value­s that are not inclu­ded."
What I'm attempting to extract is the numerical values in the statement provided. In order to account for the comma, I came up with the expression /(\d{1,3}(,\d{1,3})*)/.
Testing the following in IRB, this is the code and result:
string = "Exam­ple strin­g with 3 numbe­rs, 2 comma­s, and 6,388­ other­ value­s that are not inclu­ded."
puts strin­g.scan(/(\­d{1,3}(,\d­{1,3})*)/)­
=> "[[\"3\", nil], [\"2\", nil], [\"6,388\", \",388\"]]"
What I'm looking for is something along the lines of ["3", "2", "6,388"]. Here's the issues I need help correcting:
Why does Ruby include nil for each match group that is not comma-delimited, and how do I adjust the regular expression/match strategy to remove that and get a "flat" array?
How do I prevent the regular expression from matching a sub-expression of the substring I'm attempting to match (that is, ",388" in "6,388")?
I did attempt to use .match(), but ran into the issue that it simply returned "3" (presumably, the first value matched) with no other information apparent. Attempting to index that with [1] or [2] resulted in nil.
If there's a capturing group in pattern, String#scan returns array of arrays to express all groups.
For each match, a result is generated and either added to the result
array or passed to the block. If the pattern contains no groups, each
individual result consists of the matched string, $&. If the pattern
contains groups, each individual result is itself an array containing
one entry per group.
By removing capturing group or by replacing (...) with non-capturing group (?:...), you will get a different result:
string = "Example string with 3 numbers, 2 commas, and 6,388 other values ..."
string.scan(/\d{1,3}(?:,\d{1,3})*/) # no capturing group
# => ["3", "2", "6,388"]

Ruby regex count matched elements in the array of digits

I have a string:
'my_array1: ["1445","374","1449","378"], my_array2: ["1445","374", "1449","378"]'
I need to match all sets of digits from my_array2: [...] and count how many of them there.
I need to do something like this with regex and ruby MatchData
string = 'my_array1: ["1445","374", "1449","378"], my_array2: ["1445","374", "1449","378"]'
matches = string.match(/my_array2\:\s[\[,]\"(\d+)\"/)
count_matches = matches.size
Expected result should be 4.
What is the correct way of doing it?
If you are guaranteed that the content of my_array2 is always numeric you could simply use split twice. First you splitby my_array2: [" and then split by ,. This should give you the amount of items you are after.
If you are not guaranteed that, you could still split by my_array2 and instead of splitting again, you use a pattern such as "\d+" (or "\d+(\.\d+)? if you have floating point values) and count.
An example of the expression is available here.

Capturing groups don't work as expected with Ruby scan method

I need to get an array of floats (both positive and negative) from the multiline string. E.g.: -45.124, 1124.325 etc
Here's what I do:
text.scan(/(\+|\-)?\d+(\.\d+)?/)
Although it works fine on regex101 (capturing group 0 matches everything I need), it doesn't work in Ruby code.
Any ideas why it's happening and how I can improve that?
See scan documentation:
If the pattern contains no groups, each individual result consists of the matched string, $&. If the pattern contains groups, each individual result is itself an array containing one entry per group.
You should remove capturing groups (if they are redundant), or make them non-capturing (if you just need to group a sequence of patterns to be able to quantify them), or use extra code/group in case a capturing group cannot be avoided.
In this scenario, the capturing group is used to quantifiy a pattern sequence, thus all you need to do is convert the capturing group into a non-capturing one by replacing all unescaped ( with (?: (there is only one occurrence here):
text = " -45.124, 1124.325"
puts text.scan(/[+-]?\d+(?:\.\d+)?/)
See demo, output:
-45.124
1124.325
Well, if you need to also match floats like .04 you can use [+-]?\d*\.?\d+. See another demo
There are cases when you cannot get rid of a capturing group, e.g. when the regex contains a backreference to a capturing group. In that case, you may either a) declare a variable to store all matches and collect them all inside a scan block, or b) enclose the whole pattern with another capturing group and map the results to get the first item from each match, c) you may use a gsub with just a regex as a single argument to return an Enumerator, with .to_a to get the array of matches:
text = "11234566666678"
# Variant a:
results = []
text.scan(/(\d)\1+/) { results << Regexp.last_match(0) }
p results # => ["11", "666666"]
# Variant b:
p text.scan(/((\d)\2+)/).map(&:first) # => ["11", "666666"]
# Variant c:
p text.gsub(/(\d)\1+/).to_a # => ["11", "666666"]
See this Ruby demo.
([+-]?\d+\.\d+)
assumes there is a leading digit before the decimal point
see demo at Rubular
If you need capture groups for a complex pattern match, but want the entire expression returned by .scan, this can work for you.
Suppose you want to get the image urls in this string perhaps from a markdown text with html image tags:
str = %(
Before
<img src="https://images.zenhubusercontent.com/11223344e051aa2c30577d9d17/110459e6-915b-47cd-9d2c-1842z4b73d71">
After
<img src="https://user-images.githubusercontent.com/111222333/75255445-f59fb800-57af-11ea-9b7a-a235b84bf150.png">).strip
You may have a regular expression defined to match just the urls, and maybe used a Rubular example like this to build/test your Regexp
image_regex =
/https\:\/\/(user-)?images.(githubusercontent|zenhubusercontent).com.*\b/
Now you don't need each sub-capture group, but just the the entire expression in your your .scan, you can just wrap the whole pattern inside a capture group and use it like this:
image_regex =
/(https\:\/\/(user-)?images.(githubusercontent|zenhubusercontent).com.*\b)/
str.scan(image_regex).map(&:first)
=> ["https://user-images.githubusercontent.com/1949900/75255445-f59fb800-57af-11ea-9b7a-e075f55bf150.png",
"https://user-images.githubusercontent.com/1949900/75255473-02bca700-57b0-11ea-852a-58424698cfb0.png"]
How does this actually work?
Since you have 3 capture groups, .scan alone will return an Array of arrays with, one for each capture:
str.scan(image_regex)
=> [["https://user-images.githubusercontent.com/111222333/75255445-f59fb800-57af-11ea-9b7a-e075f55bf150.png", "user-", "githubusercontent"],
["https://images.zenhubusercontent.com/11223344e051aa2c30577d9d17/110459e6-915b-47cd-9d2c-0714c8f76f68", nil, "zenhubusercontent"]]
Since we only want the 1st (outter) capture group, we can just call .map(&:first)

At which position does the regex fail?

I need a very simple string validator that would show where is first symbol not corresponding to the desired format. I want to use regex but in this case I have to find the place where the string stops corresponding to the expression and I can't find a method that would do that.
(It's got to be a fairly simple method... maybe there isn't one?)
For example if I have regex:
/^Q+E+R+$/
with string:
"QQQQEEE2ER"
The desired result should be 7
An idea: what you can do is to tokenize your pattern and write it with optional nested capturing groups:
^(Q+(E+(R+($)?)?)?)?
Then you only need to count the number of capture groups you obtain to know where the regex engine stops in the pattern and you can determine the offset of the match end in the string with the whole match length.
As #zx81 notices it in his comment, if one of the elements can match the next element (example Q can match the element E), things become different.
Let's say that Q is \w (and can match E and R). For the string QQQEEERRR the precedent pattern will give only one capturing group (the greedy \w+ matches all) when ^(\w+)(E+)(R+)$ will give three groups: QQQEE, E, RRR
To obtain the same result you need to add an alternation:
^((?:\w+(?=E)|\w+)(E+(R+($)?)?)?)?
In the alternation, the case where E exists must be tested first, and only if this branch fails (with the lookahead), then the other branch where E doesn't exist is used.
Thus the full pattern can be rewritten like this to deal with this specific case:
^((?:Q+(?=E)|Q+)((?:E+(?=R)|E+)((?:R+(?=$)|R+)($)?)?)?)?
Perhaps could you take a look to the gem amatch too.
This is an interesting task that can be accomplished with a neat regex trick:
^(?:(?=(Q+)))?(?:(?=(Q+E+)))?(?:(?=(Q+E+R+)))?(?:(?=(Q+E+R+$)))?
We have four optional lookaheads checking various parts of the pattern and capturing the partial matches to Groups 1, 2, 3 and 4 incrementally.
Group 1 contains Q+ if it can be matched, in your example QQQQ.
Group 2 contains Q+E+ if it can be matched, in your example EEE.
Group 3 contains Q+E+R+ if it can be matched, in your example nil.
Group 3 contains Q+E+R+$ if it can be matched, in your example nil.
In your code, check which is the last Group that is set by testing !$1.nil?, !$2.nil? and so on.
The last one set gives you the length that is matchable, so in your example $2.length gives you the 7 you wanted.
Incidentally, the fact that Group 2 is the last one set also tells you that we fail on R+.
For your example, you could do the following.
Code
Change your regex from:
/^Q+E+R+$/
to
R = /^(Q*)(E*)(R*)/
and then apply the following method to the string:
def nbr_matched_chars(str)
str.scan(R).flatten.reduce(0) {|t,e| return t if e.nil?; t+e.size }
end
str matches the original regex if and only if nbr_matched_chars(str) == str.size.
Examples
nbr_matched_chars("QQQQEEE2ER") #=> 7
nbr_matched_chars("QQQQEEEERR") #=> 10 (= "QQQQEEEERR".size)
nbr_matched_chars("QQAQQEEEER") #=> 2
Explanation
To see why this [evidently :-)] works, we can look at the results of invoking String#scan, followed by Array#flatten:
"QQQQEEE2ER".scan(r).flatten #=> ["QQQQ", "EEE" , nil ]
"QQQQEEEERR".scan(r).flatten #=> ["QQQQ", "EEEE", "RR"]
"QQAQQEEEER".scan(r).flatten #=> ["QQ" , nil , nil ]

Resources