Non-greedy subgroup Ruby regular expression matching - ruby

I'm trying to write a regex to parse the vendor, version, and format components of a media-type string, where the version will be after the final dash. For example:
matching on "vnd.mycompany-foo-bar-v1+json" should produce ['mycompany-foo-bar', 'v1', 'json']
matching on "vnd.mycompany-v1+json" should produce ['mycompany', 'v1', 'json']
matching on "vnd.mycompany+json" should produce ['mycompany', nil, 'json']
matching on "vnd.mycompany-foo-bar-v1" should produce ['mycompany-foo-bar', 'v1', nil]
So far the closest I've got is
/\Avnd\.([a-z0-9*.\-_!#\$&\^]+?)(?:-([a-z0-9*\-.]+))?(?:\+([a-z0-9*\-.+]+))?\z/
but matching against "vnd.mycompany-foo_bar-v1+json" gives me ['mycompany', 'foo-bar-v1', 'json'].
It's the possibly infinite number of dashes that's throwing me for a loop.

Regex:
\Avnd\.(.+?)(?:-([^-+]+))?(?:\+(.*))?\z
regex101 Demo
Break-down:
\Avnd\. Matches vnd. literally form the start of string
(.+?) Matches any char, as few as possible times [group 1]
(?:-([^-+]+))? Optional. Match a - followed by any number of chars except - and + [group 2]
(?:\+(.*))? Optional. Match a + followed by any chars. [group 3]
\z Until the end of string.

If the version is after the final dash, then version (and format) can't contain dashes. Just take them out of the character class.
/\Avnd\.([a-z0-9*.\-_!#\$&\^]+?)(?:-([a-z0-9*.]+))?(?:\+([a-z0-9*.+]+))?\z/

Related

Splitting the content of brackets without separating the brackets ruby

I am currently working on a ruby program to calculate terms. It works perfectly fine except for one thing: brackets. I need to filter the content or at least, to put the content into an array, but I have tried for an hour to come up with a solution. Here is my code:
splitted = term.split(/\(+|\)+/)
I need an array instead of the brackets, for example:
"1-(2+3)" #=>["1", "-", ["2", "+", "3"]]
I already tried this:
/(\((?<=.*)\))/
but it returned:
Invalid pattern in look-behind.
Can someone help me with this?
UPDATE
I forgot to mention, that my program will split the term, I only need the content of the brackets to be an array.
If you need to keep track of the hierarchy of parentheses with arrays, you won't manage it just with regular expressions. You'll need to parse the string word by word, and keep a stack of expressions.
Pseudocode:
Expressions = new stack
Add new array on stack
while word in string:
if word is "(": Add new array on stack
Else if word is ")": Remove the last array from the stack and add it to the (next) last array of the stack
Else: Add the word to the last array of the stack
When exiting the loop, there should be only one array in the stack (if not, you have inconsistent opening/closing parentheses).
Note: If your ultimate goal is to evaluate the expression, you could save time and parse the string in Postfix aka Reverse-Polish Notation.
Also consider using off-the-shelf libraries.
A solution depends on the pattern you expect between the parentheses, which you have not specified. (For example, for "(st12uv)" you might want ["st", "12", "uv"], ["st12", "uv"], ["st1", "2uv"] and so on). If, as in your example, it is a natural number followed by a +, followed by another natural number, you could do this:
str = "1-( 2+ 3)"
r = /
\(\s* # match a left parenthesis followed by >= 0 whitespace chars
(\d+) # match one or more digits in a capture group
\s* # match >= 0 whitespace chars
(\+) # match a plus sign in a capture group
\s* # match >= 0 whitespace chars
(\d+) # match one or more digits in a capture group
\s* # match >= 0 whitespace chars
\) # match a right parenthesis
/x
str.scan(r0).first
=> ["2", "+", "3"]
Suppose instead + could be +, -, * or /. Then you could change:
(\+)
to:
([-+*\/])
Note that, in a character class, + needn't be escaped and - needn't be escaped if it is the first or last character of the class (as in those cases it would not signify a range).
Incidentally, you received the error message, "Invalid pattern in look-behind" because Ruby's lookarounds cannot contain variable-length matches (i.e., .*). With positive lookbehinds you can get around that by using \K instead. For example,
r = /
\d+ # match one or more digits
\K # forget everything previously matched
[a-z]+ # match one or more lowercase letters
/x
"123abc"[r] #=> "abc"

Ruby regex - using optional named backreferences

I am trying to write a Ruby regex that will return a set of named matches. If the first element (defined by slashes) is found anywhere later in the string then I want the match to return that 2nd match onward. Otherwise, return the whole string. The closest I've gotten is (?<p1>top_\w+).*?(?<hier>\k<p1>.*) which doesn't work for the 3rd item. I've tried regex ifthen-else constructs but Rubular says it's invalid. I've tried (?<p1>[\w\/]+?)(?<hier>\k<p1>.*) which correct splits the 1st and 4th lines but doesn't work for the others. Please note: I want all results to return as the same named reference so I can iterate through "hier".
Input:
top_cat/mouse/dog/top_cat/mouse/dog/elephant/horse
top_ab12/hat[1]/top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool
top_bat/car[0]
top_2/top_1/top_3/top_4/top_2/top_1/top_3/top_4/dog
Output:
hier = top_cat/mouse/dog/elephant/horse
hier = top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool
hier = top_bat/car[0]
hier = top_2/top_1/top_3/top_4/dog
Problem
The reason it does not match the second line is because the second instance of hat does not end with a slash, but the first instance does.
Solution
Specify that there is a slash between the first and second match
Regex
(top_.*)/(\1.*$)|(^.*$)
Replacement
hier = \2\3
Example
Regex101 Permalink
More info on the Alternation token
To explain how the | token works in regex, see the example: abc|def
What this regex means in plain english is:
Match either the regex below (attempting the next alternative only if this one fails)
Match the characters abc literally
Or match the regex below (the entire match attempt fails if this one fails to match)
Match the characters def literally
Example
Regex: alpha|alphabet
If we had a phrase "I know the alphabet", only the word alpha would be matched.
However, if we changed the regex to alphabet|alpha, we would match alphabet.
So you can see, alternation works in a left-to-right fashion.
paths = %w(
top_cat/mouse/dog/top_cat/mouse/dog/elephant/horse
top_ab12/hat/top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool
top_bat/car[0]
top_2/top_1/top_3/top_4/top_2/top_1/top_3/top_4/dog
test/test
)
paths.each do |path|
md = path.match(/^([^\/]*).*\/(\1(\/.*|$))/)
heir = md ? md[2] : path
puts heir
end
Output:
top_cat/mouse/dog/elephant/horse
top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool
top_bat/car[0]
top_2/top_1/top_3/top_4/dog
test

Alternation gives unexpected result

In Ruby, try to extract some patterns from a long string and put each matched pattern into an array of string. For example, the long string input can be
"\"/ebooks/1234.pdf\" \"/magazines/4321.djvu\""
The expected result is
["/ebooks/1234.pdf", "/magazines/4321.djvu"]
That is a forward slash, followed by one of the three keywords: ebooks, magazines, or newspapers, followed by another forward slash, followed by an arbitrary number of non-whitespace characters except the double quote mark.
Tried this pattern using alternation (the pipe vertical bar), but failed:
/\/(ebooks|magazines)\/[^\s"]+/
Which gives this result:
[["ebooks"], ["magazines"]]
What should be the correct pattern?
"\"/ebooks/1234.pdf\" \"/magazines/4321.djvu\""
.scan(/\/(?:ebooks|magazines|newspapers)\/[^\s"]+/)
# => ["/ebooks/1234.pdf", "/magazines/4321.djvu"]
"\"/ebooks/1234.pdf\" \"/magazines/4321.djvu\""
.scan(/"([^"]+)"/).flatten
# => ["/ebooks/1234.pdf", "/magazines/4321.djvu"]

Why won't my simple regex pattern match and remove a file extension?

I have a string:
app_copy--28.ipa
The result I want is:
app_copy
The number after -- could be of variable length, so I want to match everything including and after --.
I've tried a few patterns, but none are matching for some reason:
gsub("--\*", "")
gsub("--*", "")
gsub("--*.ipa", "")
gsub("--\[0-9].ipa", "")
What am I missing?
Let's take a look at your test patterns:
"--\*" is actually equivalent to "--*" (since the \* is an escape sequence).
"--*" will match a single - character, followed by zero or more - characters.
"--*.ipa" will match a single - character, followed by zero or more - characters, followed by any single character, followed by a literal ipa.
"--\[0-9].ipa" is actually equivalent to "--[0-9].ipa" (since the \[ is an escape sequence), which will match a literal --, followed by a single decimal digit, followed by any single character, followed by a literal ipa.
However, none of these patterns would work as you used them because gsub will not treat it as a regular expression:
The pattern is typically a Regexp; if given as a String, any regular expression metacharacters it contains will be interpreted literally…
You'd need to wrap type convert your pattern to a Regexp (using Regexp.new), or use a regular expression literal.
Try this pattern
--.*
This pattern will find any literal --, followed by zero or more of any character.
For example:
"app_copy--28.ipa".gsub(/--.*/, "") # app_copy
Don't use gsub to try to change the string, simply use a pattern to match the part you want:
"app_copy--28.ipa"[/^(.+?)--/, 1] # => "app_copy"
String's [] takes a lot of different types of parameters. You can pass in a pattern, and the index of the capture that you want, to extract just that part. From the documentation:
str[regexp, capture] → new_str or nil
If a Regexp is supplied, the matching portion of the string is returned. If a capture follows the regular expression, which may be a capture group index or name, follows the regular expression that component of the MatchData is returned instead.
How is this ?
str = "app_copy--28.ipa"
str[0..str.index("-")-1]
# => "app_copy"
str = "app_copy--28.ipa"
str.split("--").first
# => "app_copy"

Match consecutive list of exactly one character in set with regular expressions

I don't think I'll even try to explain this, I don't know the words to, but I'd like to achieve the following:
Given a string like this:
+++>><<<--
I'd like a match to give me: +++, but also match if any of the other characters were in the string consecutively like they are. So if the +++ wasn't there, I'd like to match >>.
I tried using the following regular expression:
([><\-\+]+)
However, given the string above, it would match the entire string, and not the first list of consecutive characters.
If it makes a difference, this is in Ruby (1.9.3).
Not sure about the ruby bit, but you can do this with backreferences in the pattern:
(.)\1+
What this does is to use a capturing group () to capture any character . followed by any number + of the same character \1. The \1 is a backreference to the the first captured group; in a pattern with more capturing groups \2 would be the second captured group and so on.
Java Example
Pattern p = Pattern.compile("(.)\\1+");
Matcher m = p.matcher("aaabbccaa");
m.find();
System.out.println(m.group(0)); // prints "aaa"
Ruby Example
# Return an array of matched patterns.
string = '+++>><<<--'
string.scan( /((.)\2+)/ ).collect { |match| match.first }

Resources