Convert ruby regular expression definition to python regex - ruby

I've following regexes defined for capturing the gem names in a Gemfile.
GEM_NAME = /[a-zA-Z0-9\-_\.]+/
QUOTED_GEM_NAME = /(?:(?<gq>["'])(?<name>#{GEM_NAME})\k<gq>|%q<(?<name>#{GEM_NAME})>)/
I want to convert these into a regex that can be used in python and other languages.
I tried (?:(["'])([a-zA-Z0-9\-_\.]+)\k["']|%q<([a-zA-Z0-9\-_\.]+)>) based on substitution and several similar combinations but none of them worked. Here's the regexr link http://regexr.com/3g527
Can someone please explain what should be correct process for converting these ruby regular expression defintions into a form that can be used by python.

To define a named group, you need to use (?P<name>) and then (?p=name) named
If you can afford a 3rd party library, you may use PyPi regex module and use the approach you had in Ruby (as regex supports multiple identically named capturing groups):
s = """%q<Some-name1> "some-name2" 'some-name3'"""
GEM_NAME = r'[a-zA-Z0-9_.-]+'
QUOTED_GEM_NAME = r'(?:(?P<gq>["\'])(?<name>{0})(?P=gq)|%q<(?P<name>{0})>)'.format(GEM_NAME)
print(QUOTED_GEM_NAME)
# => # (?:(?P<gq>["\'])(?<name>[a-zA-Z0-9_.-]+)(?P=gq)|%q<(?P<name>[a-zA-Z0-9_.-]+)>)
import regex
res = [x.group("name") for x in regex.finditer(QUOTED_GEM_NAME, s)]
print(res)
# => ['Some-name1', 'some-name2', 'some-name3']
backreference in the replacement pattern.
See this Python demo.
If you decide to go with Python re, it can't handle identically named groups in one regex pattern.
You can discard the named groups altogether and use numbered ones, and use re.finditer to iterate over all the matches with comprehension to grab the right capture.
Example Python code:
import re
GEM_NAME = r'[a-zA-Z0-9_.-]+'
QUOTED_GEM_NAME = r"([\"'])({0})\1|%q<({0})>".format(GEM_NAME)
s = """%q<Some-name1> "some-name2" 'some-name3'"""
matches = [x.group(2) if x.group(1) else x.group(3) for x in re.finditer(QUOTED_GEM_NAME, s)]
print(matches)
# => ['Some-name1', 'some-name2', 'some-name3']
So, ([\"'])({0})\1|%q<({0})> has got 3 capturing groups: if Group 1 matches, the first alternative got matched, thus, Group 2 is taken, else, the second alternative matched, and Group 3 value is grabbed in the comprehension.
Pattern details
([\"']) - Group 1: a " or '
({0}) - Group 2: GEM_NAME pattern
\1 - inline backreference to the Group 1 captured value (note that r'...' raw string literal allows using a single backslash to define a backreference in the string literal)
| - or
%q< - a literal substring
({0}) - Group 3: GEM_NAME pattern
> - a literal >.

You can rewrite your pattern like this:
GEM_NAME = r'[a-zA-Z0-9_.-]+'
QUOTED_GEM_NAME = r'''["'%] # first possible character
(?:(?<=%)q<)? # if preceded by a % match "q<"
(?P<name> # the three possibilities excluding the delimiters
(?<=") {0} (?=") |
(?<=') {0} (?=') |
(?<=<) {0} (?=>)
)
["'>] #'"# closing delimiter
(?x) # switch the verbose mode on for all the pattern
'''.format(GEM_NAME)
demo
Advantages:
the pattern doesn't start with an alternation that makes the search slow. (the alternation here is only tested at interesting positions after a quote or a %, when your version tests each branch of the alternation for each position in the string). This optimisation technique is called "the first character discrimination" and consists to quickly discard useless positions in a string.
you need only one capture group occurrence (quotes and angle brackets are excluded from it and only tested with lookarounds). This way you can use re.findall to get a list of gems without further manipulation.
the gq group wasn't useful and was removed (shorten a pattern at the cost of creating a useless capture group isn't a good idea)
Note that you don't need to escape the dot inside a character class.

A simple way is to use a conditional and consolidate the name.
(?:(?:(["'])|%q<)(?P<name>[a-zA-Z0-9\-_\.]+)(?(1)\1|>))
Expanded
(?:
(?: # Delimiters
( ["'] ) # (1), ' or "
| # or,
%q< # %q
)
(?P<name> [a-zA-Z0-9\-_\.]+ ) # (2), Name
(?(1) \1 | > ) # Did group 1 match ? match it here, else >
)
Python
import re
s = ' "asdf" %q<asdfasdf> '
print ( re.findall( r'(?:(?:(["\'])|%q<)(?P<name>[a-zA-Z0-9\-_\.]+)(?(1)\1|>))', s ) )
Output
[('"', 'asdf'), ('', 'asdfasdf')]

Related

Ruby regex avoid matching a group

I have this code running inside a buffer (used to unescape a JS string in Ruby):
elsif hex_substring =~ /^\\u[0-9a-fA-F]{1,4}/
hex_substring.scan(/^((\\u[\da-fA-F]{4}){1,})/) do |match|
hex_byte = match[0]
buffer << JSON.load(%Q("#{hex_byte}"))
hex_index += hex_byte.length
end
...
I have a concern that the scan() is matching a bit too much:
hex_substring.scan(/^((\\u[\da-fA-F]{4}){1,})/)
# => [["\\ud83c\\udfec", "\\udfec"]]
I am using only "\\ud83c\\udfec", not "\\udfec".
Is there a way in Ruby or in regex to grab only the first part?
You should use a single grouping construct here, the one to match 1 or more occurrences of four hex chars, and omit the inner capturing group that resulted in an extra item in the resulting array:
.scan(/^(?:\\u[\da-fA-F]{4})+/)
Note that + is a simpler and shorter way to write {1,} (one or more occurrences).
Details
^ - start of string
(?: - start of a non-capturing group (what it matches won't be added to the final scan result):
\\u - a \u substring
[\da-fA-F]{4} - four hex chars
)+ - 1 or more occurrences (of the group pattern sequence).

Regular Expression replacement to convert Less mixins to Scss

I'm looking to convert Less mixin calls to their equivalents in Scss:
.mixin(); should become #mixin();
.mixin(0); should become #mixin(0);
.mixin(0; 1; 2); should become #mixin(0, 1, 2);
I'm having the most difficulty with the third example, as I essentially need to match n groups separated by semicolons, and replace those with the same groups separated by commas. I suppose this relies on some sort of repeating groups functionality in regexes that I'm not familiar with.
It's not simply enough to simply replace semicolons within paren - I need a regex that will only match the \.[\w\-]+\(.*\) format of mixins, but obviously with some magic in the second match group to handle the 3rd example above.
I'm doing this in Ruby, so if you're able to provide replacement syntax that's compatible with gsub, that would be awesome. I would like a single regex replacement, something that doesn't require multiple passes to clean up the semicolons.
I suggest adding two capturing groups round the subvalues you need and using an additional gsub in the first gsub block to replace the ; with , only in the 2nd group.
See
s = ".mixin(0; 1; 2);"
puts s.gsub(/\.([\w\-]+)(\(.*\))/) { "##{$1}#{$2.gsub(/;/, ',')}" }
# => #mixin(0, 1, 2);
The pattern details:
\. - a literal dot
([\w\-]+) - Group 1 capturing 1 or more word chars ([a-zA-Z0-9_]) or -
(\(.*\)) - Group 2 capturing a (, then any 0+ chars other than linebreak symbols as many as possible up to the last ) and the last ). NOTE: if there are multiple values, use lazy matching - (\(.*?\)) - here.
Here you go:
less_style = ".mixin(0; 1; 2);"
# convert the first period to #
less_style.gsub! /^\./, '#'
# convert the inner semicolons to commas
scss_style = less_style.gsub /(?<=[\(\d]);/, ','
scss_style
# => "#mixin(0, 1, 2);"
The second regex is using positive lookbehinds. You can read about those here: http://www.regular-expressions.info/lookaround.html
I also use this neat web app to play around with regexes: http://rubular.com/
This will get you a single pass through gsub:
".mixin(0; 1; 2);".gsub(/(?<!\));|\./, ";" => ",", "." => "#")
=> "#mixin(0, 1, 2);"
It's an OR regex with a hash for the replacement parameters.
Assuming from your example that you just want to replace semicolons not following close parens(negative lookbehind): (?<!\));
You can modify/build on this with other expressions. Even add more OR conditions to the regex.
Also, you can use the block version of gsub if you need more options.

use of the ampersand here means pre_match?

What is the ampersand doing in the code below?
s.reverse.gsub( /\d{3}(?=\d)/, '\&,' ).reverse
One would think, after attempting to look up such things, that it is a special variable meaning post_match or pre_match, but the docs say nothing about ampersands - only dollar signs either followed by or preceded by a tick mark.
\& defines the whole string that is matched by the regex. see this simplified example:
s = "p1:1 1:1";
print s.gsub( /[a-z]/, '[\&],' ) ## only p is matched
output: [p],1:1 1:1
Similarly, the \1 defines the first group that is matched from the regex. (Similar goes for \2,\3... so on). An example:
s = "p1:1 1:1";
print s.gsub( /(\d:\d)/, '[\1]' )
output: p[1:1] [1:1]

Match consecutive list of exactly one character in set with regular expressions

I don't think I'll even try to explain this, I don't know the words to, but I'd like to achieve the following:
Given a string like this:
+++>><<<--
I'd like a match to give me: +++, but also match if any of the other characters were in the string consecutively like they are. So if the +++ wasn't there, I'd like to match >>.
I tried using the following regular expression:
([><\-\+]+)
However, given the string above, it would match the entire string, and not the first list of consecutive characters.
If it makes a difference, this is in Ruby (1.9.3).
Not sure about the ruby bit, but you can do this with backreferences in the pattern:
(.)\1+
What this does is to use a capturing group () to capture any character . followed by any number + of the same character \1. The \1 is a backreference to the the first captured group; in a pattern with more capturing groups \2 would be the second captured group and so on.
Java Example
Pattern p = Pattern.compile("(.)\\1+");
Matcher m = p.matcher("aaabbccaa");
m.find();
System.out.println(m.group(0)); // prints "aaa"
Ruby Example
# Return an array of matched patterns.
string = '+++>><<<--'
string.scan( /((.)\2+)/ ).collect { |match| match.first }

Ruby Regex match unless escaped with \

Using Ruby I'm trying to split the following text with a Regex
~foo\~\=bar =cheese~monkey
Where ~ or = denotes the beginning of match unless it is escaped with \
So it should match
~foo\~\=bar
then
=cheese
then
~monkey
I thought the following would work, but it doesn't.
([~=]([^~=]|\\=|\\~)+)(.*)
What is a better regex expression to use?
edit To be more specific, the above regex matches all occurrences of = and ~
edit Working solution. Here is what I came up with to solve the issue. I found that Ruby 1.8 has look ahead, but doesn't have lookbehind functionality. So after looking around a bit, I came across this post in comp.lang.ruby and completed it with the following:
# Iterates through the answer clauses
def split_apart clauses
reg = Regexp.new('.*?(?:[~=])(?!\\\\)', Regexp::MULTILINE)
# need to use reverse since Ruby 1.8 has look ahead, but not look behind
matches = clauses.reverse.scan(reg).reverse.map {|clause| clause.strip.reverse}
matches.each do |match|
yield match
end
end
What does "remove the head" mean in this context?
If you want to remove everything before a certain char, this will do:
.*?(?<!\\)= // anything up to the first "=" that is not preceded by "\"
.*?(?<!\\)~ // same, but for the squiggly "~"
.*?(?<!\\)(?=~) // same, but excluding the separator itself (if you need that)
Replace by "", repeat, done.
If your string has exactly three elements ("1=2~3") and you want to match all of them at once, you can use:
^(.*?(?<!\\)(?:=))(.*?(?<!\\)(?:~))(.*)$
matches: \~foo\~\=bar =cheese~monkey
| 1 | 2 | 3 |
Alternatively, you split the string using this regex:
(?<!\\)[=~]
returns: ['\~foo\~\=bar ', 'cheese', 'monkey'] for "\~foo\~\=bar =cheese~monkey"
returns: ['', 'foo\~\=bar ', 'cheese', 'monkey'] for "~foo\~\=bar =cheese~monkey"

Resources