Ruby - best way to extract regex capture groups?

Ruby - best way to extract regex capture groups? - ruby

I was reading a regex group matching question and I see that there are two ways to reference capture groups from a regex expression, namely,
Match string method e.g. string.match(/(^.*)(:)(.*)/i).captures
Perl-esque capture group variables such as $1, $2, etc obtained from if match =~ /(^.*)(:)(.*)/i
Update: As mentioned by 0xCAFEBABE there is a third option too - the last_match method
Which is better? With 1), for safety, you would have to use an if statement to guard against nils so why not just extract the information then? Instead of a second step calling the string captures method. So option 2) looks more convenient to me.

Since v2.4.6, Ruby has had named_captures, which can be used like this. Just add the ?<some_name> syntax inside a capture group.
/(\w)(\w)/.match("ab").captures # => ["a", "b"]
/(\w)(\w)/.match("ab").named_captures # => {}
/(?<some_name>\w)(\w)/.match("ab").captures # => ["a"]
/(?<some_name>\w)(\w)/.match("ab").named_captures # => {"some_name"=>"a"}
Even more relevant, you can reference a named capture by name!
result = /(?<some_name>\w)(\w)/.match("ab")
result["some_name"] # => "a"

For simple tasks, directly accessing the pseudo variables $1, etc. may be short and easier, but when things get complicated, accessing things via MatchData instances is (nearly) the only way to go.
For example, suppose you are doing nested gsub:
string1.gsub(regex1) do |string2|
string2.gsub(regex2) do
... # Impossible/difficult to refer to match data of outer loop
end
end
Within the inner loop, suppose you wanted to refer to a captured group of the outer gsub. Calling $1, $2, etc. would not give the right result because the last match data has changed by doing the inner gsub loop. This will be a source of bug.
It is necessary to refer to captured groups via match data:
string1.gsub(regex1) do |string2|
m1 = $~
string2.gsub(regex2) do
m2 = $~
... # match data of the outer loop can be accessed via `m1`.
# match data of the inner loop can be accessed via `m2`.
end
end
In short, if you want to do short hackish things for simple tasks, you can use the pseudo variables. If you want to keep your code more structured and expandable, you should access data through match data.

Related

Use middle of a string variable in Chef

I have a code like this in Chef
{
'home/user1/folder/file.erb'=>'/home/user1/folder/file',
'home/user2/folder/file.erb'=>'/home/user2/folder/file',
'home/user3/folder/file.erb'=>'/home/user3/folder/file',
'home/user4/folder/file.erb'=>'/home/user4/folder/file',
}.each do |s,d|
template d do
source s
owner user
group user
mode '600'
end
end
How do I replace value of owner and group with user1, user2, user3... from variable d?
Thanks!

Split Your Hash Values on /
There are certainly other ways to do this, but given your example an easy trick is simply to grab the user's directory from each hash value into a block-local variable at the top of each loop, which you can then reuse as needed. For example:
{
'home/user1/folder/file.erb' => '/home/user1/folder/file',
'home/user2/folder/file.erb' => '/home/user2/folder/file',
'home/user3/folder/file.erb' => '/home/user3/folder/file',
'home/user4/folder/file.erb' => '/home/user4/folder/file',
}.each do |src, dst|
# capture username for use as owner & group
usr = dst.split(?/)[2]
template dest do
source src
owner usr
group usr
mode '600'
end
end
Using String#split works by breaking the string into an Array of elements using / as a separator. Indexing into the array with [2] gives you the third element, which is the username, which you are apparently also using for the group.
The fact that it's the third element rather than the second isn't intuitive. However, when you use #split on your sample code, you get results like this:
'/home/user4/folder/file'.split ?/
#=> ["", "home", "user4", "folder", "file"]
Because of the way #split works, your inputs will yield an empty string as the first element of each destination array. Since Ruby arrays are zero-indexed, the element you want is the third one (e.g. [2]) in each of your sample values.
There are certainly other ways to do this, but this is a simple way to do what you want without making significant changes to your code. It often helps to remember that Chef (and Puppet!) are really just DSLs built on top of Ruby, so you can often use standard Ruby methods to get the job done.

Extracting a substring from a string using `Regexp.new`

I have a string like this:
var = "Renewal Quote RQ00041233 (Payment Pending) Policy R38A014294-1"
I have to extract "Payment Pending" from that string using only the information included in another single string.
The following:
var[/\((.*)\)/, 1]
will extract what I want. I can include the string representation of the regex in the string to be given, and construct the regular expression from it using Regexp.new, but I have no way to achieve the information 1 used as the second argument of [].
Without the second argument 1,
regex_string = '\((.*)\)'
var[Regexp.new(regex_string)]
fetches the string "(Payment Pending)"instead of the expected "Payment Pending".
Can someone help me?

Not sure what you are trying to do, but you can get rid of capturing groups using a different regex:
var[/(?<=\().*(?=\))/]
# => "Payment Pending"
or
var[Regexp.new('(?<=\().*(?=\))')]
# => "Payment Pending"

/\((.*)\)/ is just shorthand for Regexp.new('\((.*)\)').
String#[] takes a regex and a capture group as two separate arguments. var[/\((.*)\)/, 1] is var[Regex, 1].
The important thing to realize is 1 is passed to var[], not the regex.
re = Regexp.new('\((.*)\)')
match = var[re, 1]
Note: you might want to require a named capture group rather than a numbered one. It's very easy to accidentally include an extra capture group in a regex.

Assuming there are no nested parenthesis in the string, one way to do that without using a regular expression is as follows.
instance_eval "var[(i=var.index('(')+1)..var.index(')',i)-1]"
#=> "Payment Pending"
See String#index, particularly the reference to the optional second argument, "offset".

Why regex works in javascript, but don't work in ruby?

text = 'http://www.site.info www.escola.ninja.br google.com.ag'
expression: (http:\/\/)?((www\.)?\w+\.\w{2,}(\.\w{2,})?)
In Javascript, this expression works, returning:
["http://www.site.info", "www.escola.ninja.br", "google.com.ag"]
Why it's not working in ruby?
For example:
using the Match method:
p text.match(/(http:\/\/)?(www\.)?\w+\.\w{2,}(\.\w{2})?/)
#<MatchData "http://www.site.info" 1:"http://" 2:"www." 3:nil>
using the Scan method:
p text.scan(/(http:\/\/)?(www\.)?\w+\.\w{2,}(\.\w{2})?/)
[["http://", "www.", nil], [nil, "www.", ".br"], [nil, nil, ".ag"]]
How can I return the following array instead?
["http://www.site.info", "www.escola.ninja.br", "google.com.ag"]

Because according to the Ruby String#scan method:
If the pattern contains groups, each individual result is itself an array containing one entry per group.
So you can simply modify the expression so that the groups are non-capturing by converting (...) to (?:...), resulting in the following expression
text.scan(/(?:http:\/\/)?(?:(?:www\.)?\w+\.\w{2,}(?:\.\w{2,})?)/)
# => ["http://www.site.info", "www.escola.ninja.br", "google.com.ag"]

The reason is that str.match(/regex/g) in JS does not keep captured substrings, see MDN String#match() reference:
If the regular expression includes the g flag, the method returns an Array containing all matched substrings rather than match objects. Captured groups are not returned.
In Ruby, you have to modify the pattern to remove redundant capturing groups and turn capturing ones into non-capturing (that is, replace unescaped ( with (?:) because otherwise, only the captured substrings will get output by the String#scan method:
If the pattern contains no groups, each individual result consists of the matched string, $&. If the pattern contains groups, each individual result is itself an array containing one entry per group.
Use
text = 'http://www.site.info www.escola.ninja.br google.com.ag'
puts text.scan(/(?:http:\/\/)?(?:www\.)?\w+\.\w{2,}(?:\.\w{2,})?/)
Output of the demo:
http://www.site.info
www.escola.ninja.br
google.com.ag

Capturing groups don't work as expected with Ruby scan method

I need to get an array of floats (both positive and negative) from the multiline string. E.g.: -45.124, 1124.325 etc
Here's what I do:
text.scan(/(\+|\-)?\d+(\.\d+)?/)
Although it works fine on regex101 (capturing group 0 matches everything I need), it doesn't work in Ruby code.
Any ideas why it's happening and how I can improve that?

See scan documentation:
If the pattern contains no groups, each individual result consists of the matched string, $&. If the pattern contains groups, each individual result is itself an array containing one entry per group.
You should remove capturing groups (if they are redundant), or make them non-capturing (if you just need to group a sequence of patterns to be able to quantify them), or use extra code/group in case a capturing group cannot be avoided.
In this scenario, the capturing group is used to quantifiy a pattern sequence, thus all you need to do is convert the capturing group into a non-capturing one by replacing all unescaped ( with (?: (there is only one occurrence here):
text = " -45.124, 1124.325"
puts text.scan(/[+-]?\d+(?:\.\d+)?/)
See demo, output:
-45.124
1124.325
Well, if you need to also match floats like .04 you can use [+-]?\d*\.?\d+. See another demo
There are cases when you cannot get rid of a capturing group, e.g. when the regex contains a backreference to a capturing group. In that case, you may either a) declare a variable to store all matches and collect them all inside a scan block, or b) enclose the whole pattern with another capturing group and map the results to get the first item from each match, c) you may use a gsub with just a regex as a single argument to return an Enumerator, with .to_a to get the array of matches:
text = "11234566666678"
# Variant a:
results = []
text.scan(/(\d)\1+/) { results << Regexp.last_match(0) }
p results # => ["11", "666666"]
# Variant b:
p text.scan(/((\d)\2+)/).map(&:first) # => ["11", "666666"]
# Variant c:
p text.gsub(/(\d)\1+/).to_a # => ["11", "666666"]
See this Ruby demo.

([+-]?\d+\.\d+)
assumes there is a leading digit before the decimal point
see demo at Rubular

If you need capture groups for a complex pattern match, but want the entire expression returned by .scan, this can work for you.
Suppose you want to get the image urls in this string perhaps from a markdown text with html image tags:
str = %(
Before
<img src="https://images.zenhubusercontent.com/11223344e051aa2c30577d9d17/110459e6-915b-47cd-9d2c-1842z4b73d71">
After
<img src="https://user-images.githubusercontent.com/111222333/75255445-f59fb800-57af-11ea-9b7a-a235b84bf150.png">).strip
You may have a regular expression defined to match just the urls, and maybe used a Rubular example like this to build/test your Regexp
image_regex =
/https\:\/\/(user-)?images.(githubusercontent|zenhubusercontent).com.*\b/
Now you don't need each sub-capture group, but just the the entire expression in your your .scan, you can just wrap the whole pattern inside a capture group and use it like this:
image_regex =
/(https\:\/\/(user-)?images.(githubusercontent|zenhubusercontent).com.*\b)/
str.scan(image_regex).map(&:first)
=> ["https://user-images.githubusercontent.com/1949900/75255445-f59fb800-57af-11ea-9b7a-e075f55bf150.png",
"https://user-images.githubusercontent.com/1949900/75255473-02bca700-57b0-11ea-852a-58424698cfb0.png"]
How does this actually work?
Since you have 3 capture groups, .scan alone will return an Array of arrays with, one for each capture:
str.scan(image_regex)
=> [["https://user-images.githubusercontent.com/111222333/75255445-f59fb800-57af-11ea-9b7a-e075f55bf150.png", "user-", "githubusercontent"],
["https://images.zenhubusercontent.com/11223344e051aa2c30577d9d17/110459e6-915b-47cd-9d2c-0714c8f76f68", nil, "zenhubusercontent"]]
Since we only want the 1st (outter) capture group, we can just call .map(&:first)

Ruby Regexp group matching, assign variables on 1 line

I'm currently trying to rexp a string into multiple variables. Example string:
ryan_string = "RyanOnRails: This is a test"
I've matched it with this regexp, with 3 groups:
ryan_group = ryan_string.scan(/(^.*)(:)(.*)/i)
Now to access each group I have to do something like this:
ryan_group[0][0] (first group) RyanOnRails
ryan_group[0][1] (second group) :
ryan_group[0][2] (third group) This is a test
This seems pretty ridiculous and it feels like I'm doing something wrong. I would be expect to be able to do something like this:
g1, g2, g3 = ryan_string.scan(/(^.*)(:)(.*)/i)
Is this possible? Or is there a better way than how I'm doing it?

You don't want scan for this, as it makes little sense. You can use String#match which will return a MatchData object, you can then call #captures to return an Array of captures. Something like this:
#!/usr/bin/env ruby
string = "RyanOnRails: This is a test"
one, two, three = string.match(/(^.*)(:)(.*)/i).captures
p one #=> "RyanOnRails"
p two #=> ":"
p three #=> " This is a test"
Be aware that if no match is found, String#match will return nil, so something like this might work better:
if match = string.match(/(^.*)(:)(.*)/i)
one, two, three = match.captures
end
Although scan does make little sense for this. It does still do the job, you just need to flatten the returned Array first. one, two, three = string.scan(/(^.*)(:)(.*)/i).flatten

You could use Match or =~ instead which would give you a single match and you could either access the match data the same way or just use the special match variables $1, $2, $3
Something like:
if ryan_string =~ /(^.*)(:)(.*)/i
first = $1
third = $3
end

You can name your captured matches
string = "RyanOnRails: This is a test"
/(?<one>^.*)(?<two>:)(?<three>.*)/i =~ string
puts one, two, three
It doesn't work if you reverse the order of string and the regex.

You have to decide whether it is a good idea, but ruby regexp can (automagically) define local variables for you!
I am not yet sure whether this feature is awesome or just totally crazy, but your regex can define local variables.
ryan_string = "RyanOnRails: This is a test"
/^(?<webframework>.*)(?<colon>:)(?<rest>)/ =~ ryan_string
# This defined three variables for you. Crazy, but true.
webframework # => "RyanOnRails"
puts "W: #{webframework} , C: #{colon}, R: #{rest}"
(Take a look at http://ruby-doc.org/core-2.1.1/Regexp.html , search for "local variable").
Note:
As pointed out in a comment, I see that there is a similar and earlier answer to this question by #toonsend (https://stackoverflow.com/a/21412455). I do not think I was "stealing", but if you want to be fair with praises and honor the first answer, feel free :) I hope no animals were harmed.

scan() will find all non-overlapping matches of the regex in your string, so instead of returning an array of your groups like you seem to be expecting, it is returning an array of arrays.
You are probably better off using match(), and then getting the array of captures using MatchData#captures:
g1, g2, g3 = ryan_string.match(/(^.*)(:)(.*)/i).captures
However you could also do this with scan() if you wanted to:
g1, g2, g3 = ryan_string.scan(/(^.*)(:)(.*)/i)[0]

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Ruby - best way to extract regex capture groups? - ruby

Related

Use middle of a string variable in Chef

Extracting a substring from a string using `Regexp.new`

Why regex works in javascript, but don't work in ruby?

Capturing groups don't work as expected with Ruby scan method

Ruby Regexp group matching, assign variables on 1 line

Categories

Resources