Regular expression - get multiple matches into each group - ruby

I have a string like this:
raw_string = "(a=1)(b=2)(c=3)"
I'd like to match this and get values within each set of parentheses, and get each result in a group.
For example:
group 0 = "a=1"
group 1 = "b=2" and so on..
I've tried /(\(.*\))/g but it doesn't seem to work. Can someone help me with this?
thanks!

str = "(a=1)(b=2) (c=3)"
As suggested in a comment by #stribizhev:
r = /
\( # Match a left paren
([^\)]+) # Match >= 1 characters other than a right paren in capture group 1
\) # Match a right paren
/x # extended/free-spacing regex definition mode
str.scan(r).flatten
#=> ["a=1", "b=2", "c=3"]
Note ([^\)]+) could replaced by (.+?), making it a lazy match on any characters, as I've done in this alternative regex, which uses lookarounds rather than a capture group:
r = /
(?<=\() # Match a left paren in a positive lookbehind
.+? # Match >= 1 characters lazily
(?=\)) # Match a right paren in a positive lookahead
/x
Here the lookbehind could be replaced by \(\K, which reads, "match a left paren then forget about everything matched so far".
Lastly, you could use String#split on the right then left paren, possibly separated by spaces, then delete the first left and last right parens:
str.split(/\)\s*\(/).map { |s| s.delete '()' }
#=> ["a=1", "b=2", "c=3"]
Wouldn't it be nice if we could write s.strip(/[()]/)?

If you mean the pattern with parentheses appears exactly three times (or a different fixed number of times), then it is possible, but if you intend that the pattern appears an arbitrary number of times, then you can't. A regex can only have a fixed number of captures or named captures.

Just to show that you can get them into an arbitrary number of capture groups:
"(a=1)(b=2)(c=3)"[/#{'(?:\((.*?)\))?' * 99}/]
[$1, $2, $3]
#=> ["a=1", "b=2", "c=3"]

Related

Positive Lookahead and Non-capturing group difference

When you want to match either of two patterns but not capture it, you would use a noncapturing group ?::
/(?:https?|ftp)://(.+)/
But what if I want to capture '_1' in the string 'john_1'. It could be '2' or '' followed by anything else. First I tried a non-capturing group:
'john_1'.gsub(/(?:.+)(_.+)/, "")
=> ""
It does not work. I am telling it to not capture one or more characters but to capture _ and all characters after it.
Instead the following works:
'john_1'.gsub(/(?=.+)(_.+)/, "")
=> "john"
I used a positive lookahead. The definition I found for positive lookahead was as follows:
q(?=u) matches a q that is
followed by a u, without making the u part of the match. The positive
lookahead construct is a pair of parentheses, with the opening
parenthesis followed by a question mark and an equals sign.
But that definition doesn't really fit my example. What makes the Positive Lookahead work but not the Non-capturing group work in the example I provide?
Capturing and matching are two different things. (?:expr) doesn't capture expr, but it's still included in the matched string. Zero-width assertions, e.g. (?=expr), don't capture or include expr in the matched string.
Perhaps some examples will help illustrate the difference:
> "abcdef"[/abc(def)/] # => abcdef
> $1 # => def
> "abcdef"[/abc(?:def)/] # => abcdef
> $1 # => nil
> "abcdef"[/abc(?=def)/] # => abc
> $1 # => nil
When you use a non-capturing group in your String#gsub call, it's still part of the match, and gets replaced by the replacement string.
Your first example doesn't work because a non-capturing group is still part of the overall capture, whereas the lookbehind is only used for matching but isn't part of the overall capture.
This is easier to understand if you get the actual match data:
# Non-capturing group
/(?:.+)(_.+)/.match 'john_1'
=> #<MatchData "john_1" 1:"_1">
# Positive Lookbehind
/(?=.+)(_.+)/.match 'john_1'
=> #<MatchData "_1" 1:"_1">
EDIT: I should also mention that sub and gsub work on the entire capture, not individual capture groups (although those can be used in the replacement).
'john_1'.gsub(/(?:.+)(_.+)/, 'phil\1')
=> "phil_1"
Let's consider a couple of situations.
The string preceding the underscore must be "john" and the underscore is followed by one or more characters
str = "john_1"
You have two choices.
Use a positive lookbehind
str[/(?<=john)_.+/]
#=> "_1"
The positive lookbehind requires that "john" must appear immediately before the underscore, but it is not part of the match that is returned.
Use a capture group:
str[/john(_.+)/, 1]
#=> "_1"
This regular expression matches "john_1", but "_.+" is captured in capture group 1. By examining the doc for the method String#[] you will see that one form of the method is str[regexp, capture], which returns the contents of the capture group capture. Here capture equals 1, meaning the first capture group.
Note that the string following the underscore may contain underscores: "john_1_a"[/(?<=john)_.+/] #=> "_1_a".
If the underscore can be at the end of the string replace + with * in the above regular expressions (meaning match zero or more characters after the underscore).
The string preceding the underscore can be anything and and the underscore is followed by one or more characters
str = "john_mary_tom_julie"
We may consider two cases.
The string returned is to begin with the first underscore
In this case we could write:
str[/_.+/]
#=> "_mary_tom_julie"
This works because the regex is by default greedy, meaning it will begin at the first underscore encountered.
The string returned is to begin with the last underscore
Here we could write:
str[/_[^_]+\z/]
#=> "_julie"
This regex matches an underscore followed by one or more characters that are not underscores, followed by the end-of-string anchor (\z).
Aside: the method String#[]
[] may seem an odd name for a method but it is a method nevertheless, so it can be invoked in the conventional way:
str.[](/john(_.+)/, 1)
#=> "_1"
The expression str[/john(_.+)/, 1] is an example (of which there are many in Ruby) of syntactic sugar. When written str[...] Ruby converts it to the conventional expression for methods before evaluating it.

Regex to grab full firstname and first letter of last name

I have a list of users grabbed by the Etc Ruby library:
Thomas_J_Perkins
Jennifer_Scanner
Amanda_K_Loso
Aaron_Cole
Mark_L_Lamb
What I need to do is grab the full first name, skip the middle name (if given), and grab the first character of the last name. The output should look like this:
Thomas P
Jennifer S
Amanda L
Aaron C
Mark L
I'm not sure how to do this, I've tried grabbing all of the characters: /\w+/ but that will grab everything.
You don't always need regular expressions.
Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems. Jamie Zawinski
You can do it with some simple Ruby code
string = "Mark_L_Lamb"
string.split('_').first + ' ' + string.split('_').last[0]
=> "Mark L"
I think its simpler without regex:
array = "Thomas_J_Perkins".split("_") # split at _
array.first + " " + array.last[0] # .first prints first name .last[0] prints first char of last name
#=> "Thomas P"
You can use
^([^\W_]+)(?:_[^\W_]+)*_([^\W_])[^\W_]*$
And replace with \1_\2. See the regex demo
The [^\W_] matches a letter or a digit. If you want to only match letters, replace [^\W_] with \p{L}.
^(\p{L}+)(?:_\p{L}+)*_(\p{L})\p{L}*$
See updated demo
The point is to match and capture the first chunk of letters up to the first _ (with (\p{L}+)), then match 0+ sequences of _ + letters inside (with (?:_\p{L}+)*_) and then match and capture the last word first letter (with (\p{L})) and then match the rest of the string (with \p{L}*).
NOTE: replace ^ with \A and $ with \z if you have independent strings (as in Ruby ^ matches the start of a line and $ matches the end of the line).
Ruby code:
s.sub(/^(\p{L}+)(?:_\p{L}+)*_(\p{L})\p{L}*$/, "\\1_\\2")
I'm in the don't-use-a-regex-for-this camp.
str1 = "Alexander_Graham_Bell"
str2 = "Sylvester_Grisby"
"#{str1[0...str1.index('_')]} #{str1[str1.rindex('_')+1]}"
#=> "Alexander B"
"#{str2[0...str2.index('_')]} #{str2[str2.rindex('_')+1]}"
#=> "Sylvester G"
or
first, last = str1.split(/_.+_|_/)
#=> ["Alexander", "Bell"]
first+' '+last[0]
#=> "Alexander B"
first, last = str2.split(/_.+_|_/)
#=> ["Sylvester", "Grisby"]
first+' '+last[0]
#=> "Sylvester G"
but if you insist...
r = /
(.+?) # match any characters non-greedily in capture group 1
(?=_) # match an underscore in a positive lookahead
(?:.*) # match any characters greedily in a non-capture group
(?:_) # match an underscore in a non-capture group
(.) # match any character in capture group 2
/x # free-spacing regex definition mode
str1 =~ r
$1+' '+$2
#=> "Alexander B"
str2 =~ r
$1+' '+$2
#=> "Sylvester G"
You can of course write
r = /(.+?)(?=_)(?:.*)(?:_)(.)/
This is my attempt:
/([a-zA-Z]+)_([a-zA-Z]+_)?([a-zA-Z])/
See demo
Let's see if this works:
/^([^_]+)(?:_\w)?_(\w)/
And then you'll have to combine the first and second matches into the format you want. I don't know Ruby, so I can't help you there.
And another attempt using a replacement method:
result = subject.gsub(/^([^_]+)(?:_[^_])?_([^_])[^_]+$/, '\1 \2')
We capture the entire string, with the relevant parts in capturing groups. Then just return the two captured groups
using the split method is much better
full_names.map do |full_name|
parts = full_name.split('_').values_at(0,-1)
parts.last.slice!(1..-1)
parts.join(' ')
end
/^[A-Za-z]{5,15}\s[A-Za-z]{1}]$/i
This will have the following criteria:
5-15 characters for first name then a whitespace and finally a single character for last name.

I need a regex to capture just the number from a string

I do not have access to the code, this is via an interface that only allows me to edit the regex that parses user responses. I need to extract the weight after users text, where they text things like:
wt 172.5
172.5 lbs
180
wt. 173.22
172,5
I need to capture the weight as a float field, but I want to restrict it to at most 1 decimal place. I tried using /(?<val>[\d+((\.|,)\d\d?)?]/ but it is only saving the first digit "1" in the field
Sometimes what seems most simple is not. I suggest using this regex:
r = /(?<=\A|\s)\d+(?:[.,]\d)?(?=\d|\s|\z)/
We can alternatively define the regex using extended or free-spacing mode (by adding the modifier x after the final /), which allows us to include documentation:
r = /
(?<=\A|\s) # match beginning of string or space in a positive lookbehind
\d+ # match one or more digits
(?:[.,]\d)? # optionally (? after non-capture group) match a . or , then a digit
(?=\d|\s|\z) # match a digit, space or the end of the string in a positive lookahead
/x
"wt 172.5"[r] #=> "172.5"
"172.5 lbs"[r] #=> "172.5"
"180"[r] #=> "180"
"wt. 173.22"[r] #=> "173.2"
"172,5"[r] #=> "172,5"
"A1 143.66"[r] #=> "143.6"
"A1 1.3.4 43.6"[r] #=> "43.6"
\d+(?:[,.]\d{1,2})?
Guess you wanted this .[] is character class,not what you think.Your character class captures just one out of all characters you have defined.
See demo.
https://regex101.com/r/eB8xU8/12

Ruby regular expression non capture group

I'm trying to grab id number from the string, say
id/number/2000GXZ2/ref=sr
using
(?:id\/number\/)([a-zA-Z0-9]{8})
for some reason non capture group is not worked, giving me:
id/number/2000GXZ2
As mentioned by others, non-capturing groups still count towards the overall match. If you don't want that part in your match use a lookbehind.
Rubular example
(?<=id\/number\/)([a-zA-Z0-9]{8})
(?<=pat) - Positive lookbehind assertion: ensures that the preceding characters match pat, but doesn't include those characters in the matched text
Ruby Doc Regexp
Also, the capture group around the id number is unnecessary in this case.
You have:
str = "id/number/2000GXZ2/ref=sr"
r = /
(?:id\/number\/) # match string in a non-capture group
([a-zA-Z0-9]{8}) # match character in character class 8 times, in capture group 1
/x # extended/free-spacing regex definition mode
Then (using String#[]):
str[r]
#=> "id/number/2000GXZ2"
returns the entire match, as it should, not just the contents of capture group 1. There are a few ways to remedy this. Consider first ones that do not use a capture group.
#jacob.m suggested putting the first part in a positive lookbehind (modified slightly from his code):
r = /
(?<=id\/number\/) # match string in positive lookbehind
[[:alnum:]]{8} # match >= 1 alphameric characters
/x
str[r]
#=> "2000GXZ2"
An alternative is:
r = /
id\/number\/ # match string
\K # forget everything matched so far
[[:alnum:]]{8} # match 8 alphanumeric characters
/x
str[r]
#=> "2000GXZ2"
\K is especially useful when the match to forget is variable-length, as (in Ruby) positive lookbehinds do not work with variable-length matches.
With both of these approaches, if the part to be matched contains only numbers and capital letters, you may want to use [A-Z0-9]+ instead of [[:alnum:]] (though the latter includes Unicode letters, not just those from the English alphabet). In fact, if all the entries have the form of your example, you might be able to use:
r = /
\d # match a digit
[A-Z0-9]{7} # match >= 0 capital letters or digits
/x
str[r]
#=> "2000GXZ2"
The other line of approach is to keep your capture group. One simple way is:
r = /
id\/number\/ # match string
([[:alnum:]]{8}) # match >= 1 alphameric characters in capture group 1
/x
str =~ r
str[r, 1] #=> "2000GXZ2"
Alternatively, you could use String#sub to replace the entire string with the contents of the capture group:
r = /
id\/number\/ # match string
([[:alnum:]]{8}) # match >= 1 alphameric characters in capture group 1
.* # match the remainder of the string
/x
str.sub(r, '\1') #=> "2000GXZ2"
str.sub(r, "\\1") #=> "2000GXZ2"
str.sub(r) { $1 } #=> "2000GXZ2"
This is Ruby Regexp expected match consistency evilness. Some Regexp-style methods will return the global-match while others will return specified matches.
In this case, one method we can use to get the behavior you're looking for is scan.
I don't think anyone here actually mentions how to get your Regexp working as you originally intended, which was to get the capture-only match. To do that, you would use the scan method like so with your original pattern:
test_me.rb
test_string="id/number/2000GXZ2/ref=sr"
result = test_string.scan(/(?:id\/number\/)([a-zA-Z0-9]{8})/)
puts result
2000GXZ2
That said, replacing (?:) with (?<=) for non-capture groups for look-behinds will benefit you both when you use scan as well as other parts of ruby that use Regexps.

Ruby search a string for matching character pairs

I want to match character pairs in a string. Let's say the string is:
"zttabcgqztwdegqf". Both "zt" and "gq" are matching pairs of characters in the string.
The following code finds the "zt" matching pair, but not the "gq" pair:
#!/usr/bin/env ruby
string = "zttabcgqztwdegqf"
puts string.scan(/.{1,2}/).detect{ |c| string.count(c) > 1 }
The code provides matching pairs where the indices of the pairs are 0&1,2&3,4&5... but not 1&2,3&4,5&6, etc:
zt
ta
bc
gq
zt
wd
eg
qf
I'm not sure regex in Ruby is the best way to go. But I want to use Ruby for the solution.
You can do your search with a single regex:
puts string.scan(/(?=(.{2}).*\1)/)
regex101 demo
Output
zt
gq
Regex Breakout
(?= # Start a lookahead
(.{2}) # Search any couple of char and group it in \1
.*\1 # Search ahead in the string for another \1 to validate
) # Close lookahead
Note
Putting all the checks inside lookahead assure the regex engine does not consume the couple when validates it.
So it also works with overlapping couples like in the string abcabc: the output will correctly be ab,bc.
Oddity
If the regex engine does not consume the chars how it can reach the end of the string?
Internally after the check Onigmo (the ruby regex engine) makes one step further automatically. Most regex flavours behaves in this way but e.g. the javascript engine needs the programmer to increment the last match index manually.
str = "ztcabcgqzttwtcdegqf"
r = /
(.) # match any character in capture group 1
(?= # begin a positive lookahead
(.) # match any character in capture group 2
.+ # match >= 1 characters
\1 # match capture group 1
\2 # match capture group 2
) # close positive lookahead
/x # extended/free-spacing regex definition mode
str.scan(r).map(&:join)
#=> ["zt", "tc", "gq"]
Here is one way to do this without using regex:
string = "zttabcgqztwdegqf"
p string.split('').each_cons(2).map(&:join).select {|i| string.scan(i).size > 1 }.uniq
#=> ["zt", "gq"]

Resources