Ruby Regex match unless escaped with \ - ruby

Using Ruby I'm trying to split the following text with a Regex
~foo\~\=bar =cheese~monkey
Where ~ or = denotes the beginning of match unless it is escaped with \
So it should match
~foo\~\=bar
then
=cheese
then
~monkey
I thought the following would work, but it doesn't.
([~=]([^~=]|\\=|\\~)+)(.*)
What is a better regex expression to use?
edit To be more specific, the above regex matches all occurrences of = and ~
edit Working solution. Here is what I came up with to solve the issue. I found that Ruby 1.8 has look ahead, but doesn't have lookbehind functionality. So after looking around a bit, I came across this post in comp.lang.ruby and completed it with the following:
# Iterates through the answer clauses
def split_apart clauses
reg = Regexp.new('.*?(?:[~=])(?!\\\\)', Regexp::MULTILINE)
# need to use reverse since Ruby 1.8 has look ahead, but not look behind
matches = clauses.reverse.scan(reg).reverse.map {|clause| clause.strip.reverse}
matches.each do |match|
yield match
end
end

What does "remove the head" mean in this context?
If you want to remove everything before a certain char, this will do:
.*?(?<!\\)= // anything up to the first "=" that is not preceded by "\"
.*?(?<!\\)~ // same, but for the squiggly "~"
.*?(?<!\\)(?=~) // same, but excluding the separator itself (if you need that)
Replace by "", repeat, done.
If your string has exactly three elements ("1=2~3") and you want to match all of them at once, you can use:
^(.*?(?<!\\)(?:=))(.*?(?<!\\)(?:~))(.*)$
matches: \~foo\~\=bar =cheese~monkey
| 1 | 2 | 3 |
Alternatively, you split the string using this regex:
(?<!\\)[=~]
returns: ['\~foo\~\=bar ', 'cheese', 'monkey'] for "\~foo\~\=bar =cheese~monkey"
returns: ['', 'foo\~\=bar ', 'cheese', 'monkey'] for "~foo\~\=bar =cheese~monkey"

Related

Convert ruby regular expression definition to python regex

I've following regexes defined for capturing the gem names in a Gemfile.
GEM_NAME = /[a-zA-Z0-9\-_\.]+/
QUOTED_GEM_NAME = /(?:(?<gq>["'])(?<name>#{GEM_NAME})\k<gq>|%q<(?<name>#{GEM_NAME})>)/
I want to convert these into a regex that can be used in python and other languages.
I tried (?:(["'])([a-zA-Z0-9\-_\.]+)\k["']|%q<([a-zA-Z0-9\-_\.]+)>) based on substitution and several similar combinations but none of them worked. Here's the regexr link http://regexr.com/3g527
Can someone please explain what should be correct process for converting these ruby regular expression defintions into a form that can be used by python.
To define a named group, you need to use (?P<name>) and then (?p=name) named
If you can afford a 3rd party library, you may use PyPi regex module and use the approach you had in Ruby (as regex supports multiple identically named capturing groups):
s = """%q<Some-name1> "some-name2" 'some-name3'"""
GEM_NAME = r'[a-zA-Z0-9_.-]+'
QUOTED_GEM_NAME = r'(?:(?P<gq>["\'])(?<name>{0})(?P=gq)|%q<(?P<name>{0})>)'.format(GEM_NAME)
print(QUOTED_GEM_NAME)
# => # (?:(?P<gq>["\'])(?<name>[a-zA-Z0-9_.-]+)(?P=gq)|%q<(?P<name>[a-zA-Z0-9_.-]+)>)
import regex
res = [x.group("name") for x in regex.finditer(QUOTED_GEM_NAME, s)]
print(res)
# => ['Some-name1', 'some-name2', 'some-name3']
backreference in the replacement pattern.
See this Python demo.
If you decide to go with Python re, it can't handle identically named groups in one regex pattern.
You can discard the named groups altogether and use numbered ones, and use re.finditer to iterate over all the matches with comprehension to grab the right capture.
Example Python code:
import re
GEM_NAME = r'[a-zA-Z0-9_.-]+'
QUOTED_GEM_NAME = r"([\"'])({0})\1|%q<({0})>".format(GEM_NAME)
s = """%q<Some-name1> "some-name2" 'some-name3'"""
matches = [x.group(2) if x.group(1) else x.group(3) for x in re.finditer(QUOTED_GEM_NAME, s)]
print(matches)
# => ['Some-name1', 'some-name2', 'some-name3']
So, ([\"'])({0})\1|%q<({0})> has got 3 capturing groups: if Group 1 matches, the first alternative got matched, thus, Group 2 is taken, else, the second alternative matched, and Group 3 value is grabbed in the comprehension.
Pattern details
([\"']) - Group 1: a " or '
({0}) - Group 2: GEM_NAME pattern
\1 - inline backreference to the Group 1 captured value (note that r'...' raw string literal allows using a single backslash to define a backreference in the string literal)
| - or
%q< - a literal substring
({0}) - Group 3: GEM_NAME pattern
> - a literal >.
You can rewrite your pattern like this:
GEM_NAME = r'[a-zA-Z0-9_.-]+'
QUOTED_GEM_NAME = r'''["'%] # first possible character
(?:(?<=%)q<)? # if preceded by a % match "q<"
(?P<name> # the three possibilities excluding the delimiters
(?<=") {0} (?=") |
(?<=') {0} (?=') |
(?<=<) {0} (?=>)
)
["'>] #'"# closing delimiter
(?x) # switch the verbose mode on for all the pattern
'''.format(GEM_NAME)
demo
Advantages:
the pattern doesn't start with an alternation that makes the search slow. (the alternation here is only tested at interesting positions after a quote or a %, when your version tests each branch of the alternation for each position in the string). This optimisation technique is called "the first character discrimination" and consists to quickly discard useless positions in a string.
you need only one capture group occurrence (quotes and angle brackets are excluded from it and only tested with lookarounds). This way you can use re.findall to get a list of gems without further manipulation.
the gq group wasn't useful and was removed (shorten a pattern at the cost of creating a useless capture group isn't a good idea)
Note that you don't need to escape the dot inside a character class.
A simple way is to use a conditional and consolidate the name.
(?:(?:(["'])|%q<)(?P<name>[a-zA-Z0-9\-_\.]+)(?(1)\1|>))
Expanded
(?:
(?: # Delimiters
( ["'] ) # (1), ' or "
| # or,
%q< # %q
)
(?P<name> [a-zA-Z0-9\-_\.]+ ) # (2), Name
(?(1) \1 | > ) # Did group 1 match ? match it here, else >
)
Python
import re
s = ' "asdf" %q<asdfasdf> '
print ( re.findall( r'(?:(?:(["\'])|%q<)(?P<name>[a-zA-Z0-9\-_\.]+)(?(1)\1|>))', s ) )
Output
[('"', 'asdf'), ('', 'asdfasdf')]

Ruby regex - using optional named backreferences

I am trying to write a Ruby regex that will return a set of named matches. If the first element (defined by slashes) is found anywhere later in the string then I want the match to return that 2nd match onward. Otherwise, return the whole string. The closest I've gotten is (?<p1>top_\w+).*?(?<hier>\k<p1>.*) which doesn't work for the 3rd item. I've tried regex ifthen-else constructs but Rubular says it's invalid. I've tried (?<p1>[\w\/]+?)(?<hier>\k<p1>.*) which correct splits the 1st and 4th lines but doesn't work for the others. Please note: I want all results to return as the same named reference so I can iterate through "hier".
Input:
top_cat/mouse/dog/top_cat/mouse/dog/elephant/horse
top_ab12/hat[1]/top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool
top_bat/car[0]
top_2/top_1/top_3/top_4/top_2/top_1/top_3/top_4/dog
Output:
hier = top_cat/mouse/dog/elephant/horse
hier = top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool
hier = top_bat/car[0]
hier = top_2/top_1/top_3/top_4/dog
Problem
The reason it does not match the second line is because the second instance of hat does not end with a slash, but the first instance does.
Solution
Specify that there is a slash between the first and second match
Regex
(top_.*)/(\1.*$)|(^.*$)
Replacement
hier = \2\3
Example
Regex101 Permalink
More info on the Alternation token
To explain how the | token works in regex, see the example: abc|def
What this regex means in plain english is:
Match either the regex below (attempting the next alternative only if this one fails)
Match the characters abc literally
Or match the regex below (the entire match attempt fails if this one fails to match)
Match the characters def literally
Example
Regex: alpha|alphabet
If we had a phrase "I know the alphabet", only the word alpha would be matched.
However, if we changed the regex to alphabet|alpha, we would match alphabet.
So you can see, alternation works in a left-to-right fashion.
paths = %w(
top_cat/mouse/dog/top_cat/mouse/dog/elephant/horse
top_ab12/hat/top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool
top_bat/car[0]
top_2/top_1/top_3/top_4/top_2/top_1/top_3/top_4/dog
test/test
)
paths.each do |path|
md = path.match(/^([^\/]*).*\/(\1(\/.*|$))/)
heir = md ? md[2] : path
puts heir
end
Output:
top_cat/mouse/dog/elephant/horse
top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool
top_bat/car[0]
top_2/top_1/top_3/top_4/dog
test

Regex: Substring the second last value between two slashes of a url string

I have a string like this:
http://www.example.com/value/1234/different-value
How can I extract the 1234?
Note: There may be a slash at the end:
http://www.example.com/value/1234/different-value
http://www.example.com/value/1234/different-value/
/([^/]+)(?=/[^/]+/?$)
should work. You might need to format it differently according to the language you're using. For example, in Ruby, it's
if subject =~ /\/([^\/]+)(?=\/[^\/]+\/?\Z)/
match = $~[1]
else
match = ""
end
Use Slice for Positional Extraction
If you always want to extract the 4th element (including the scheme) from a URI, and are confident that your data is regular, you can use Array#slice as follows.
'http://www.example.com/value/1234/different-value'.split('/').slice 4
#=> "1234"
'http://www.example.com/value/1234/different-value/'.split('/').slice 4
#=> "1234"
This will work reliably whether there's a trailing slash or not, whether or not you have more than 4 elements after the split, and whether or not that fourth element is always strictly numeric. It works because it's based on the element's position within the path, rather than on the contents of the element. However, you will end up with nil if you attempt to parse a URI with fewer elements such as http://www.example.com/1234/.
Use Scan/Match for Pattern Extraction
Alternatively, if you know that the element you're looking for is always the only one composed entirely of digits, you can use String#match with look-arounds to extract just the numeric portion of the string.
'http://www.example.com/value/1234/different-value'.match %r{(?<=/)\d+(?=/)}
#=> #<MatchData "1234">
$&
#=> "1234"
The look-behind and look-ahead assertions are needed to anchor the expression to a path. Without them, you'll match things like w3.example.com too. This solution is a better approach if the position of the target element may change, and if you can guarantee that your element of interest will be the only one that matches the anchored regex.
If there will be more than one match (e.g. http://www.example.com/1234/5678/) then you might want to use String#scan instead to select the first or last match. This is one of those "know your data" things; if you have irregular data, then regular expressions aren't always the best choice.
Javascript:
var myregexp = /:\/\/.*?\/.*?\/(\d+)/;
var match = myregexp.exec(subject);
if (match != null) {
result = match[1];
}
Works with your examples... But I am sure it will fail in general...
Ruby edit:
if subject =~ /:\/\/.*?\/.*?\/(.+?)\//
match = $~[1]
It does work.
I think this is a little simpler than the accepted answer, because it doesn't use any positive lookahead (?=), but rather simply makes the last slash optional via the ? character:
^.+\/(.+)\/.+\/?$
In Ruby:
STDIN.read.split("\n").each do |nextline|
if nextline =~ /^.+\/(.+)\/.+\/?$/
printf("matched %s in %s\n", $~[1], nextline);
else
puts "no match"
end
end
Live Demo
Let's break down what's happening:
^: start of the line
.+\/: match anything (greedily) up to a slash
Since we're going to later match at least 1, at most 2 more slashes, this slash will be either the second last slash (as in http://www.example.com/value/1234/different-value) or the third last slash as in (http://www.example.com/value/1234/different-value/)
Up to this point we've matched http://www.example.com/value/ (due to greediness)
(.+)\/: Our capturing group for 1234 indicated by the parenthesis. It's anything followed by another slash.
Since the previous match matched up to the second or third last slash, this will match up to the last slash or second last slash, respectively
.+: match anything. This would be after our 1234, so we're assuming there are characters after 1234/ (different-value)
\/?: optionally match another slash (the slash after different-value)
$: match the end of the line
Note that in a url, you probably won't have spaces. I used the . character because it's easily distinguished, but perhaps you might use \S instead to match non-spaces.
Also, you might use \A instead of ^ to match start of string (instead of after line break) and \Z instead of $ to match end of string (instead of at line break)

Ruby regular expression

Apparently I still don't understand exactly how it works ...
Here is my problem: I'm trying to match numbers in strings such as:
910 -6.258000 6.290
That string should gives me an array like this:
[910, -6.2580000, 6.290]
while the string
blabla9999 some more text 1.1
should not be matched.
The regex I'm trying to use is
/([-]?\d+[.]?\d+)/
but it doesn't do exactly that. Could someone help me ?
It would be great if the answer could clarify the use of the parenthesis in the matching.
Here's a pattern that works:
/^[^\d]+?\d+[^\d]+?\d+[\.]?\d+$/
Note that [^\d]+ means at least one non digit character.
On second thought, here's a more generic solution that doesn't need to deal with regular expressions:
str.gsub(/[^\d.-]+/, " ").split.collect{|d| d.to_f}
Example:
str = "blabla9999 some more text -1.1"
Parsed:
[9999.0, -1.1]
The parenthesis have different meanings.
[] defines a character class, that means one character is matched that is part of this class
() is defining a capturing group, the string that is matched by this part in brackets is put into a variable.
You did not define any anchors so your pattern will match your second string
blabla9999 some more text 1.1
^^^^ here ^^^ and here
Maybe this is more what you wanted
^(\s*-?\d+(?:\.\d+)?\s*)+$
See it here on Regexr
^ anchors the pattern to the start of the string and $ to the end.
it allows Whitespace \s before and after the number and an optional fraction part (?:\.\d+)? This kind of pattern will be matched at least once.
maybe /(-?\d+(.\d+)?)+/
irb(main):010:0> "910 -6.258000 6.290".scan(/(\-?\d+(\.\d+)?)+/).map{|x| x[0]}
=> ["910", "-6.258000", "6.290"]
str = " 910 -6.258000 6.290"
str.scan(/-?\d+\.?\d+/).map(&:to_f)
# => [910.0, -6.258, 6.29]
If you don't want integers to be converted to floats, try this:
str = " 910 -6.258000 6.290"
str.scan(/-?\d+\.?\d+/).map do |ns|
ns[/\./] ? ns.to_f : ns.to_i
end
# => [910, -6.258, 6.29]

What's wrong with this RegEx?

I'm trying to implement this in a small ruby script, and tested it on http://www.rubular.com/, where it worked perfectly. Not sure why its not performing in the actual script.
The RegEx: /(motion|links|sound|button|symbol)|(0.\d{8})|(\s\d{1}\s)|(\d{10}\s)/
The Text it's Against:
Trial ID: 1 | Trial Type: motion | Trick? 1
Click Time: 0.87913100 1302969732
Trial ID: 7 | Trial Type: button | Trick? 0
Click Time: 0.19817800 1302987043
etc. etc.
What I am trying to grab: Only the numbers, and the single word after "Trial Type". So for the first line of the example, I would only want " 1 motion 1 0.87913100 1302969732" to be returned. I also want to keep the space before the first number in each trial.
My short ruby script:
File.open('log.txt', 'r') do |file|
contents = file.readlines.to_s
regex = Regexp.new(/(motion|links|sound|button|symbol)|(0\.\d{8})|(\s\d{1}\s)|(\d{10}\s)/)
matchdata = regex.match(contents).to_a
matchdata.each do |match|
if match != nil
puts match
end
end
end
It only outputs two "1"s though. Hmm... I know its reading the file contents right, and when I tried an alternate simplet regex it worked fine.
Thanks for any help I get here!! : )
You want to use String#scan
matchdata = contents.scan(regex)
Also #Mike Penington is correct, you shouldn't have to do the if match != nil if you do it right. You have to clean up your regex as well. The pipe character in regex is a special character to denote match the left side OR the right side, and you have the litteral pipe character that you must escape.
You need to escape the literal pipes inside the regex, fill in other missing literals (like Trick, \?, Click\sTime:, remove some of the spaces, etc...), and insert regex spaces where appropriate... i.e.
regex = Regexp.new(/(motion|links|sound|button|symbol)\s\|\sTrick\?\s*\d\s*Click\s+Time:\s+(0\.\d{,8})\s(\d{10}))/)
EDIT: fixed parenthesis nesting in the original
If you know that the data follows a particular pattern, you can just follow that pattern in the regex, and pick up the portions you want with ( ).
/Trial ID: (\d+) \| Trial Type: (\w+) \| Trick\? (\d+) Click Time: ([\.\d]+) ([\.\d]+)/
The more you know previously about the data, the more specifically you can make the regex.
If you see some variations in the data, and the regex fails to match, then just relax the pattern:
If the Trail ID, Trail ID may include a decimal point, use [\.\d]+ instead of \d+.
If the space can be more than one, then replace it with []+
If the space can be a tab, or can be absent, use \s* or [ \t]*.
If the Trial ID: part may appear as a different phrase, replace it with .*?,
and so on.
If you are not sure how many spaces/tabs appear, use this:
/Trial\s*ID:\s*(\d+)\s*\|\s*Trial\s*Type:\s*(\w+)\s*\|\s*Trick\?\s*(\d+)\s*Click\s*Time:\s*([\.\d]+)\s+([\.\d]+)/
This is one of those times that trying to everything in a big regex makes you work too hard. Simplify things:
ary = [
'Trial ID: 1 | Trial Type: motion | Trick? 1 Click Time: 0.87913100 1302969732',
'Trial ID: 7 | Trial Type: button | Trick? 0 Click Time: 0.19817800 1302987043'
]
ary.each do |li|
numbers = li.scan(/[\d.]+/)
trial_type = li[/Trial Type: (\w+)/, 1]
puts "%d %s %d %f %d\n" % [numbers.first, trial_type, *numbers[1 .. -1]]
end
# >> 1 motion 1 0.879131 1302969732
# >> 7 button 0 0.198178 1302987043
Regex patterns are powerful, but people think it's macho to do everything in one big line. You have to weigh doing that with the increased work necessary to put together the regex in the first place, plus maintain it if something changes in the text being parsed later.

Resources