Regular expression to get value in between parentheses - ruby

I am trying to write a regular expression to get the value in between parentheses. I expect a value without parentheses. For example, given:
value = "John sinu.s(14)"
I expected to get 14.
I tried the following:
value[/\(.*?\)/]
but it gives the result (14). Please help me.

You may do that using
value[/\((.*?)\)/, 1]
or
value[/\(([^()]*)\)/, 1]
Use a capturing group and a second argument to extract just the group value.
Note that \((.*?)\) will also match a substring that contains ( char in it, and the second option will only match a substring between parentheses that does not contain ( nor ) since [^()] is a negated character class that matches any char but ( and ).
See the Ruby demo online.
From the Ruby docs:
str[regexp, capture] → new_str or nil
If a Regexp is supplied, the matching portion of the string is returned. If a capture follows the regular expression, which may be a capture group index or name, follows the regular expression that component of the MatchData is returned instead.
In case you need to extract multiple occurrences, use String#scan:
value = "John sinu.s(14) and Jack(156)"
puts value.scan(/\(([^()]*)\)/)
# => [ 14, 156 ]
See another Ruby demo.

Another option is to use non-capturing look arounds like this
value[/(?<=\().*(?=\))/]
(?<=\() - positive look behind make sure there is ( but don't capture it
(?=\)) - positive look ahead make sure the regex ends with ) but don't capture it

You can use
/(?<=\\()[^\\)]+/g
which selects string inside brackets without brackets
Only thing you need is "positive lookahead" feature
Follow this link for more info about positive lookahead in special groups.
I don't know if it is supported in ruby

Try using this regular expression
/\((.*?)\)/
\( will match your opening parenthesis in the string
(.*?) creates a capturing group
\) will match your closing parenthesis

Do you wish to extract the string between the parentheses or do that using a regular expression? You specify the latter in the question but it's conceivable your question is really the former and you are assuming that a regular expression must be used.
If you just want the value, without any restriction on the method used to obtain it, you could do that quite simply using String#index and String#rindex.
s = "John sinu.s(14)"
s[s.index('(')+1 .. s.rindex(')')-1]
#=> "14"

Related

Ruby regex - gsub only captured group

I'm not quite sure I understand how non-capturing groups work. I am looking for a regex to produce this result: 5.214. I thought the regex below would work, but it is replacing everything including the non-capture groups. How can I write a regex to only replace the capture groups?
"5,214".gsub(/(?:\d)(,)(?:\d)/, '.')
# => ".14"
My desired result:
"5,214".gsub(some_regex)
#=> "5.214
non capturing groups still consumes the match
use
"5,214".gsub(/(\d+)(,)(\d+)/, '\1.\3')
or
"5,214".gsub(/(?<=\d+)(,)(?=\d+)/, '.')
You can't. gsub replaces the entire match; it does not do anything with the captured groups. It will not make any difference whether the groups are captured or not.
In order to achieve the result, you need to use lookbehind and lookahead.
"5,214".gsub(/(?<=\d),(?=\d)/, '.')
It is also possible to use Regexp.last_match (also available via $~) in the block version to get access to the full MatchData:
"5,214".gsub(/(\d),(\d)/) { |_|
match = Regexp.last_match
"#{match[1]}.#{match[2]}"
}
This scales better to more involved use-cases.
Nota bene, from the Ruby docs:
the ::last_match is local to the thread and method scope of the method that did the pattern match.
gsub replaces the entire match the regular expression engine produces. Both capturing/non-capturing group constructs are not retained. However, you could use lookaround assertions which do not "consume" any characters on the string.
"5,214".gsub(/\d\K,(?=\d)/, '.')
Explanation: The \K escape sequence resets the starting point of the reported match and any previously consumed characters are no longer included. That being said, we then look for and match the comma, and the Positive Lookahead asserts that a digit follows.
I know nothing about ruby.
But from what i see in the tutorial
gsub mean replace,
the pattern should be /(?<=\d+),(?=\d+)/ just replace the comma with dot
or, use capture /(\d+),(\d+)/ replace the string with "\1.\2"?
You can easily reference capture groups in the replacement string (second argument) like so:
"5,214".gsub(/(\d+)(,)(\d+)/, '\1.\3')
#=> "5.214"
\0 will return the whole matched string.
\1 will be replaced by the first capturing group.
\2 will be replaced by the second capturing group etc.
You could rewrite the example above using a non-capturing group for the , char.
"5,214".gsub(/(\d+)(?:,)(\d+)/, '\1.\2')
#=> "5.214"
As you can see, the part after the comma is now the second capturing group, since we defined the middle group as non-capturing.
Although it's kind of pointless in this case. You can just omit the capturing group for , altogether
"5,214".gsub(/(\d+),(\d+)/, '\1.\2')
#=> "5.214"
You don't need regexp to achieve what you need:
'1,200.00'.tr('.','!').tr(',','.').tr('!', ',')
Periods become bangs (1,200!00)
Commas become periods (1.200!00)
Bangs become commas (1.200,00)

How to use RegEx to replace items based on their context, without affecting the context

Using Ruby, I am writing a regular expression, and I need to be a able to remove any colon that appears between parentheses. I understand that I can use
"This is a (string :)".sub!(/\([^\)]*:/, '')
to do this, but the problem is that this function will also remove the context along with it. Is there any way to specify that I only want it to remove the colon and not the entire matching expression?
So some regular expression engines support what are called look-ahead and look-behind matches that will match but not consume characters. Ruby does support look-ahead, but not look-behind (which is more difficult to do in a performant way), which means you could quite easily stick with sub and remove a colon that precedes a closing parenthesis, but only without ensuring it is after an opening parenthesis:
string = 'This is a (string :)'
string.sub /:(?=\))/, ''
# => 'This is a (string )'
The alternative would be to use subpattern capturing (which happens automatically when you use grouping in your regular expression) to rebuild the string without the undesirable portion, in this case the colon:
string.sub /(\([^:]+):\)/, '\1)'
The \1 is a back-reference to what is matched in the first group, which is delimited by the parentheses that are not escaped. You can see here I didn't bother capturing the closing parenthesis in a second group, opting instead simply to include it in the substitution. This works well in this case because it will not change, but if you don't know that the colon will appear at the end of the parentheses-enclosed content, you would need a second group:
string.sub /(\([^:]+):([^)]+\))/, '\1\2'
The prior answer will mostly work for deleting single colons within paren groups, but have trouble with multiples like '(thing:foo:bar)`. It would be nice to use lookbehind and lookahead to make the within parens assertion, but ruby (and most regexp engines) doesn't support non-deterministic length patterns in lookbehind.
irb> s = 'x (a:b:c) : (1:2:3) y'
=> "x (a:b:c) : (1:2:3) y"
irb> s.gsub /(?<=\([^\(]*):(?=[^\)]*\))/, ''
SyntaxError: (irb):10: invalid pattern in look-behind: /(?<=\([^\(]*):(?=[^\)]*\))/
from /Users/dbenhur/.rbenv/versions/1.9.2-wp/bin/irb:12:in `<main>'
You could instead use the block form of gsub to capture paren enclosed groups, then remove colons from each match:
irb> s.gsub(/\([^\)]*\)/) {|m| m.delete ':'}
=> "x (abc) : (123) y"
in regex in general, you can use '(\()(:)(\))', \1\3.
I'm not familiar with Ruby. Basically what you do is you have 3 groups, and from this three groups ( : and ) you get rid of the second one, the :.
I tested it in Notepad++ and it works.
I think this is called: regex backreference
Cheers.
If you can assume all parentheses will come in balanced pairs like they do in your example, this should be all you need:
"This is a (string :)".gsub!(/:(?=[^()]*\))/, '')
If the lookahead succeeds in finding a closing paren without seeing an opening paren first, the colon must be inside a (...) sequence. Notice how I excluded the opening paren as well as the closing paren; that's essential.

Working with Regular Expressions - Repeating Patterns

I am trying to use regular expressions to match some text.
The following pattern is what I am trying to gather.
#Identifier('VariableA', 'VariableB', 'VariableX', ..., 'VariableZ')
I would like to grab a dynamic number of variables rather than a fixed set of two or three.
Is there any way to do this? I have an existing Regular Expression:
\#(\w+)\W+(\w+)\W+(\w+)\W+(\w+)
This captures the Identifier and up to three variables.
Edit: Is it just me, or are regular expressions not as powerful as I'm making them out to be?
You want to use scan for this sort of thing. The basic pattern would be this:
s.scan(/\w+/)
That would give you an array of all the contiguous sequences for word characters:
>> "#Identifier('VariableA', 'VariableB', 'VariableX', 'VariableZ')".scan(/\w+/)
=> ["Identifier", "VariableA", "VariableB", "VariableX", "VariableZ"]
You say you might have multiple instances of your pattern with arbitrary stuff surrounding them. You can deal with that with nested scans:
s.scan(/#(\w+)\(([^)]+?)\)/).map { |m| [ m.first, m.last.scan(/\w+/) ] }
That will give you an array of arrays, each inner array will have the "Identifier" part as the first element and that "Variable" parts as an array in the second element. For example:
>> s = "pancakes #Identifier('VariableA', 'VariableB', 'VariableX', 'VariableZ') pancakes #Pancakes('one','two','three') eggs"
>> s.scan(/#(\w+)\(([^)]+?)\)/).map { |m| [ m.first, m.last.scan(/\w+/) ] }
=> [["Identifier", ["VariableA", "VariableB", "VariableX", "VariableZ"]], ["Pancakes", ["one", "two", "three"]]]
If you might be facing escaped quotes inside your "Variable" bits then you'll need something more complex.
Some notes on the expression:
# # A literal "#".
( # Open a group
\w+ # One more more ("+") word characters ("\w").
) # Close the group.
\( # A literal "(", parentheses are used for group so we escape it.
( # Open a group.
[ # Open a character class.
^) # The "^" at the beginning of a [] means "not", the ")" isn't escaped because it doesn't have any special meaning inside a character class.
] # Close a character class.
+? # One more of the preceding pattern but don't be greedy.
) # Close the group.
\) # A literal ")".
You don't really need [^)]+? here, just [^)]+ would do but I use the non-greedy forms by habit because that's usually what I mean. The grouping is used to separate the #Identifier and Variable parts so that we can easily get the desired nested array output.
But alex thinks that you meant you wanted to capture the same thing four times. If you want to capture the same pattern, but different things, then you may want to consider two things:
Iteration. In perl, you can say
while ($variable =~ /regex/g) {
the 'g' stands for 'global', and means that each time the regex is called, it matches the /next/ instance.
The other option is recursion. Write your regex like this:
/(what you want)(.*)/
Then, you have backreference 1 containing the first thing, which you can push to an array, and backreference 2 which you'll then recurse over until it no longer matches.
You may use simply (\w+).
Given the input string
#Identifier('VariableA', 'VariableB', 'VariableX', 'VariableZ')
The results would be:
Identifier
VariableA
VariableB
VariableX
VariableZ
This would work for an arbitrary number of variables.
For future reference, it's easy and fun to play around with regexp ideas on Rubular.
So you are asking if there is a way to capture both the identifier and an arbitrary number of variables. I am afraid that you can only do this with regex engines that support captures. Note here that captures and capturing groups are not the one and the same thing. You want to remember all the "variables". This can't be done with simple capturing groups.
I am unaware whether Ruby supports this or not, but I am sure that .NET and the new PERL 6 support it.
In your case you could use two regexes. One to capture the identifier e.g. ^\s*#(\w+)
and another one to capture all variables e.g. result = subject.scan(/'[^']+'/)

Ruby String: how to match a Regexp from a defined position

I want to match a regexp from a ruby string only from a defined position. Matches before that position do not interest me. Moreover, I'd like \A to match this position.
I found this solution:
code[index..-1][/\A[a-z_][a-zA-Z0-9_]*/]
This match the regexp at position index in the string code. If the match is not exactly at position index, it return nil.
Is there a more elegant way to do this (I want to avoid to create the temporary string with the first slice)?
Thanks
You could use ^.{#{index}} inside the regular expression. Don't know if that's what you want, because I don't understand your question completely. Can you maybe add an example with the tested String? And have you heard of Rubular? Great way to test your regular expressions.
This is how you could do it if I understand your question correctly:
code.match(/^.{#{index}}your_regex_here/)
The index variable will be put inside your regular expression. When index = 4, it will check if there's 4 characters from the beginning. Then it will check your own regular expression and only return true if yours is valid as well. I hope it helps. Good luck.
EDIT
And if you want to get the matched value for your regular expression:
code.scan(/^.{#{index}}([a-z_][a-zA-Z0-9_]*)/).join
It puts the matched result (inside the brackets) in an Array and joins it into a String.

Very odd issue with Ruby and regex

I am getting completely different reults from string.scan and several regex testers...
I am just trying to grab the domain from the string, it is the last word.
The regex in question:
/([a-zA-Z0-9\-]*\.)*\w{1,4}$/
The string (1 single line, verified in Ruby's runtime btw)
str = 'Show more results from software.informer.com'
Work fine, but in ruby....
irb(main):050:0> str.scan /([a-zA-Z0-9\-]*\.)*\w{1,4}$/
=> [["informer."]]
I would think that I would get a match on software.informer.com ,which is my goal.
Your regex is correct, the result has to do with the way String#scan behaves. From the official documentation:
"If the pattern contains groups, each individual result is itself an array containing one entry per group."
Basically, if you put parentheses around the whole regex, the first element of each array in your results will be what you expect.
It does not look as if you expect more than one result (especially as the regex is anchored). In that case there is no reason to use scan.
'Show more results from software.informer.com'[ /([a-zA-Z0-9\-]*\.)*\w{1,4}$/ ]
#=> "software.informer.com"
If you do need to use scan (in which case you obviously need to remove the anchor), you can use (?:) to create non-capturing groups.
'foo.bar.baz lala software.informer.com'.scan( /(?:[a-zA-Z0-9\-]*\.)*\w{1,4}/ )
#=> ["foo.bar.baz", "lala", "software.informer.com"]
You are getting a match on software.informer.com. Check the value of $&. The return of scan is an array of the captured groups. Add capturing parentheses around the suffix, and you'll get the .com as part of the return value from scan as well.
The regex testers and Ruby are not disagreeing about the fundamental issue (the regex itself). Rather, their interfaces are differing in what they are emphasizing. When you run scan in irb, the first thing you'll see is the return value from scan (an Array of the captured subpatterns), which is not the same thing as the matched text. Regex testers are most likely oriented toward displaying the matched text.
How about doing this :
/([a-zA-Z0-9\-]*\.*\w{1,4})$/
This returns
informer.com
On your test string.
http://rubular.com/regexes/13670

Resources