use of the ampersand here means pre_match? - ruby

What is the ampersand doing in the code below?
s.reverse.gsub( /\d{3}(?=\d)/, '\&,' ).reverse
One would think, after attempting to look up such things, that it is a special variable meaning post_match or pre_match, but the docs say nothing about ampersands - only dollar signs either followed by or preceded by a tick mark.

\& defines the whole string that is matched by the regex. see this simplified example:
s = "p1:1 1:1";
print s.gsub( /[a-z]/, '[\&],' ) ## only p is matched
output: [p],1:1 1:1
Similarly, the \1 defines the first group that is matched from the regex. (Similar goes for \2,\3... so on). An example:
s = "p1:1 1:1";
print s.gsub( /(\d:\d)/, '[\1]' )
output: p[1:1] [1:1]

Related

Convert ruby regular expression definition to python regex

I've following regexes defined for capturing the gem names in a Gemfile.
GEM_NAME = /[a-zA-Z0-9\-_\.]+/
QUOTED_GEM_NAME = /(?:(?<gq>["'])(?<name>#{GEM_NAME})\k<gq>|%q<(?<name>#{GEM_NAME})>)/
I want to convert these into a regex that can be used in python and other languages.
I tried (?:(["'])([a-zA-Z0-9\-_\.]+)\k["']|%q<([a-zA-Z0-9\-_\.]+)>) based on substitution and several similar combinations but none of them worked. Here's the regexr link http://regexr.com/3g527
Can someone please explain what should be correct process for converting these ruby regular expression defintions into a form that can be used by python.
To define a named group, you need to use (?P<name>) and then (?p=name) named
If you can afford a 3rd party library, you may use PyPi regex module and use the approach you had in Ruby (as regex supports multiple identically named capturing groups):
s = """%q<Some-name1> "some-name2" 'some-name3'"""
GEM_NAME = r'[a-zA-Z0-9_.-]+'
QUOTED_GEM_NAME = r'(?:(?P<gq>["\'])(?<name>{0})(?P=gq)|%q<(?P<name>{0})>)'.format(GEM_NAME)
print(QUOTED_GEM_NAME)
# => # (?:(?P<gq>["\'])(?<name>[a-zA-Z0-9_.-]+)(?P=gq)|%q<(?P<name>[a-zA-Z0-9_.-]+)>)
import regex
res = [x.group("name") for x in regex.finditer(QUOTED_GEM_NAME, s)]
print(res)
# => ['Some-name1', 'some-name2', 'some-name3']
backreference in the replacement pattern.
See this Python demo.
If you decide to go with Python re, it can't handle identically named groups in one regex pattern.
You can discard the named groups altogether and use numbered ones, and use re.finditer to iterate over all the matches with comprehension to grab the right capture.
Example Python code:
import re
GEM_NAME = r'[a-zA-Z0-9_.-]+'
QUOTED_GEM_NAME = r"([\"'])({0})\1|%q<({0})>".format(GEM_NAME)
s = """%q<Some-name1> "some-name2" 'some-name3'"""
matches = [x.group(2) if x.group(1) else x.group(3) for x in re.finditer(QUOTED_GEM_NAME, s)]
print(matches)
# => ['Some-name1', 'some-name2', 'some-name3']
So, ([\"'])({0})\1|%q<({0})> has got 3 capturing groups: if Group 1 matches, the first alternative got matched, thus, Group 2 is taken, else, the second alternative matched, and Group 3 value is grabbed in the comprehension.
Pattern details
([\"']) - Group 1: a " or '
({0}) - Group 2: GEM_NAME pattern
\1 - inline backreference to the Group 1 captured value (note that r'...' raw string literal allows using a single backslash to define a backreference in the string literal)
| - or
%q< - a literal substring
({0}) - Group 3: GEM_NAME pattern
> - a literal >.
You can rewrite your pattern like this:
GEM_NAME = r'[a-zA-Z0-9_.-]+'
QUOTED_GEM_NAME = r'''["'%] # first possible character
(?:(?<=%)q<)? # if preceded by a % match "q<"
(?P<name> # the three possibilities excluding the delimiters
(?<=") {0} (?=") |
(?<=') {0} (?=') |
(?<=<) {0} (?=>)
)
["'>] #'"# closing delimiter
(?x) # switch the verbose mode on for all the pattern
'''.format(GEM_NAME)
demo
Advantages:
the pattern doesn't start with an alternation that makes the search slow. (the alternation here is only tested at interesting positions after a quote or a %, when your version tests each branch of the alternation for each position in the string). This optimisation technique is called "the first character discrimination" and consists to quickly discard useless positions in a string.
you need only one capture group occurrence (quotes and angle brackets are excluded from it and only tested with lookarounds). This way you can use re.findall to get a list of gems without further manipulation.
the gq group wasn't useful and was removed (shorten a pattern at the cost of creating a useless capture group isn't a good idea)
Note that you don't need to escape the dot inside a character class.
A simple way is to use a conditional and consolidate the name.
(?:(?:(["'])|%q<)(?P<name>[a-zA-Z0-9\-_\.]+)(?(1)\1|>))
Expanded
(?:
(?: # Delimiters
( ["'] ) # (1), ' or "
| # or,
%q< # %q
)
(?P<name> [a-zA-Z0-9\-_\.]+ ) # (2), Name
(?(1) \1 | > ) # Did group 1 match ? match it here, else >
)
Python
import re
s = ' "asdf" %q<asdfasdf> '
print ( re.findall( r'(?:(?:(["\'])|%q<)(?P<name>[a-zA-Z0-9\-_\.]+)(?(1)\1|>))', s ) )
Output
[('"', 'asdf'), ('', 'asdfasdf')]

Regex matching chars around text

I have a string with chars inside and I would like to match only the chars around a string.
"This is a [1]test[/1] string. And [2]test[/2]"
Rubular http://rubular.com/r/f2Xwe3zPzo
Currently, the code in the link matches the text inside the special chars, how can I change it?
Update
To clarify my question. It should only match if the opening and closing has the same number.
"[2]first[/2] [1]second[/2]"
In the code above, only first should match and not second. The text inside the special chars (first), should be ignored.
Try this:
(\[[0-9]\]).+?(\[\/[0-9]\])
Permalink to the example on Rubular.
Update
Since you want to remove the 'special' characters, try this instead:
foo = "This is a [1]test[/1] string. And [2]test[/2]"
foo.gsub /\[\/?\d\]/, ""
# => "This is a test string. And test"
Update, Part II
You only want to remove the 'special' characters when the surrounding tags match, so what about this:
foo = "This is a [1]test[/1] string. And [2]test[/2], but not [3]test[/2]"
foo.gsub /(?:\[(?<number>\d)\])(?<content>.+?)(?:\[\/\k<number>\])/, '\k<content>'
# => "This is a test string. And test, but not [3]test[/2]"
\[([0-9])\].+?\[\/\1\]
([0-9]) is a capture since it is surrounded with parentheses. The \1 tells it to use the result of that capture. If you had more than one capture, you could reference them as well, \2, \3, etc.
Rubular
You can also use a named capture, rather than \1 to make it a little less cryptic. As in: \[(?<number>[0-9])\].+?\[\/\k<number>\]
Here's a way to do it that uses the form of String#gsub that takes a block. The idea is to pull strings such as "[1]test[/1]" into the block, and there remove the unwanted bits.
str = "This is a [1]test[/1] string. And [2]test[/2], plus [3]test[/99]"
r = /
\[ # match a left bracket
(\d+) # capture one or more digits in capture group 1
\] # match a right bracket
.+? # match one or more characters lazily
\[\/ # match a left bracket and forward slash
\1 # match the contents of capture group 1
\] # match a right bracket
/x
str.gsub(r) { |s| s[/(?<=\]).*?(?=\[)/] }
#=> "This is a test string. And test, plus [3]test[/99]"
Aside: When I first heard of named capture groups, they seemed like a great idea, but now I wonder if they really make regexes easier to read than \1, \2....

Ruby Regex gsub - everything after string

I have a string something like:
test:awesome my search term with spaces
And I'd like to extract the string immediately after test: into one variable and everything else into another, so I'd end up with awesome in one variable and my search term with spaces in another.
Logically, what I'd so is move everything matching test:* into another variable, and then remove everything before the first :, leaving me with what I wanted.
At the moment I'm using /test:(.*)([\s]+)/ to match the first part, but I can't seem to get the second part correctly.
The first capture in your regular expression is greedy, and matches spaces because you used .. Instead try:
matches = string.match(/test:(\S*) (.*)/)
# index 0 is the whole pattern that was matched
first = matches[1] # this is the first () group
second = matches[2] # and the second () group
Use the following:
/^test:(.*?) (.*)$/
That is, match "test:", then a series of characters (non-greedily), up to a single space, and another series of characters to the end of the line.
I am guessing you want to remove all the leading spaces before the second match too, hence I have \s+ in the expression. Otherwise, remove the \s+ from the expression, and you'll have what you want:
m = /^test:(\w+)\s+(.*)/.match("test:awesome my search term with spaces")
a = m[1]
b = m[2]
http://codepad.org/JzuNQxBN

Match consecutive list of exactly one character in set with regular expressions

I don't think I'll even try to explain this, I don't know the words to, but I'd like to achieve the following:
Given a string like this:
+++>><<<--
I'd like a match to give me: +++, but also match if any of the other characters were in the string consecutively like they are. So if the +++ wasn't there, I'd like to match >>.
I tried using the following regular expression:
([><\-\+]+)
However, given the string above, it would match the entire string, and not the first list of consecutive characters.
If it makes a difference, this is in Ruby (1.9.3).
Not sure about the ruby bit, but you can do this with backreferences in the pattern:
(.)\1+
What this does is to use a capturing group () to capture any character . followed by any number + of the same character \1. The \1 is a backreference to the the first captured group; in a pattern with more capturing groups \2 would be the second captured group and so on.
Java Example
Pattern p = Pattern.compile("(.)\\1+");
Matcher m = p.matcher("aaabbccaa");
m.find();
System.out.println(m.group(0)); // prints "aaa"
Ruby Example
# Return an array of matched patterns.
string = '+++>><<<--'
string.scan( /((.)\2+)/ ).collect { |match| match.first }

How to remove the first 4 characters from a string if it matches a pattern in Ruby

I have the following string:
"h3. My Title Goes Here"
I basically want to remove the first four characters from the string so that I just get back:
"My Title Goes Here".
The thing is I am iterating over an array of strings and not all have the h3. part in front so I can't just ditch the first four characters blindly.
I checked the docs and the closest thing I could find was chomp, but that only works for the end of a string.
Right now I am doing this:
"h3. My Title Goes Here".reverse.chomp(" .3h").reverse
This gives me my desired output, but there has to be a better way. I don't want to reverse a string twice for no reason. Is there another method that will work?
To alter the original string, use sub!, e.g.:
my_strings = [ "h3. My Title Goes Here", "No h3. at the start of this line" ]
my_strings.each { |s| s.sub!(/^h3\. /, '') }
To not alter the original and only return the result, remove the exclamation point, i.e. use sub. In the general case you may have regular expressions that you can and want to match more than one instance of, in that case use gsub! and gsub—without the g only the first match is replaced (as you want here, and in any case the ^ can only match once to the start of the string).
You can use sub with a regular expression:
s = 'h3. foo'
s.sub!(/^h[0-9]+\. /, '')
puts s
Output:
foo
The regular expression should be understood as follows:
^ Match from the start of the string.
h A literal "h".
[0-9] A digit from 0-9.
+ One or more of the previous (i.e. one or more digits)
\. A literal period.
A space (yes, spaces are significant by default in regular expressions!)
You can modify the regular expression to suit your needs. See a regular expression tutorial or syntax guide, for example here.
A standard approach would be to use regular expressions:
"h3. My Title Goes Here".gsub /^h3\. /, '' #=> "My Title Goes Here"
gsub means globally substitute and it replaces a pattern by a string, in this case an empty string.
The regular expression is enclosed in / and constitutes of:
^ means beginning of the string
h3 is matched literally, so it means h3
\. - a dot normally means any character so we escape it with a backslash
is matched literally

Resources