Named subroutines in Oniguruma regex engine? - ruby

In Perl, you can do this:
(?x)
(?(DEFINE)
(?<animal>dog|cat)
)
(?&animal)
In Ruby (Oniguruma engine), it seems that the (?(DEFINE... syntax is not supported. Also, (?&... becomes \g. So, you can do this:
(?x)
(?<animal>dog|cat)
\g<animal>
But of course, this is not equivalent to the Perl example I gave above, becuase the first (?<animal>dog|cat) is not ignored, since there isn't anything like (?(DEFINE....
If I want to define a large regex with a bunch of named subroutines, what I could once do in Perl can't be done this way.
It does seem that I could hack together a pretty awkward solution by doing something like this:
(?x)
(?:^$DEFINE
(?<animal>dog|cat)
){0}
\g<animal>
But, that is pretty hackish. Is there a better way to do this? Does Oniguruma support a way to define named subroutines without having to try to "match" them first?
Alternatively, if there is a way to get true PCRE to work in Ruby, with ?(DEFINE... and (?&... I'd take that too.
Thanks!

You don't need a so complicated hack. Writing:
(?x)
(?<animal>dog|cat){0}
(?<color>red|green|blue){0}
...
your main pattern here
does exactly the same.
Putting all group definitions inside (?:^$DEFINE ... ){0} is only cosmetic.
Note that a group with the quantifier {0} isn't tried at all (the quantifier is taken in account first), and if in this way the named group is defined anyway, man can deduce that it isn't really a hack, but the way to do it with oniguruma.

Related

What does :C colon-C mean in FreeBSD make?

My company uses FreeBSD, and therefore FreeBSD's flavor of make.
A few of our in-house ports include something like this (where BRANCH is something that came from an SVN URL, either 'trunk' or a branch name like 'branches/1.2.3').
PORTVERSION= ${BRANCH:C,^branches/,,}
The Variable modifiers section of make(1) documents the :C colon-c modifier as
:C/pattern/replacement/[1gW]
Am I looking at the right documentation? ^branches/ looks like a regex pattern to me, but it looks like the actual code uses , instead of / as a separator. Did I skip documentation explaining that?
The documentation says:
:C/pattern/replacement/[1gW]
The :C modifier is just like the :S modifier except that the old and new strings, instead of being simple strings, are an extended regular expression (see regex(3)) string pattern and an ed(1)-style string replacement.
and in :S:
Any character may be used as a delimiter for the parts of the modifier string.
As #MadScientist pointed out, it's quite common to use a different delimiter, especially when / is a part of pattern or replacement string, like in your case. Otherwise it would require escaping and would look like ${BRANCH:C/^branches\///} which seems less readable.

Content Inside Parenthesis Regular Expression Ruby

I'm trying to take out the the content inside the parenthesis. For example, if the string is "(blah blah) This is stack(over)flow", I want to just take out "(blah blah)" but leave "(over)" alone. I'm trying
/\A\(.*\)/
but returns "(blah blah) This is stack(over)", and I'm sure why it's returning that.
Easiest fix:
/\A\(.*?\)/
Normally, * will try to match as much as it possibly can, so it'll match all the way to the last ) in the line. This is called "greedy" matching. Putting ? after +/*/? makes them non-greedy, and they'll match the shortest possible string.
But note that this won't work for nested parentheses. That's rather more complicated. Given your example, I assume this is for a pretty simple ad-hoc format where nesting isn't a concern.

Regexp in ruby - can I use parenthesis without grouping?

I have a regexp of the form:
/(something complex and boring)?(something complex and interesting)/
I'm interested in the contents of the second parenthesis; the first ones are there only to ensure a correct match (since the boring part might or might not be present but if it is, I'll match it by accident with the regexp for the interesting part).
So I can access the second match using $2. However, for uniformity with other regexps I'm using I want that somehow $1 will contain the contents of the second parethesis. Is it possible?
Use a non-capturing group:
r = /(?:ab)?(cd)/
This is a non-ruby regexp feature. Use /(?:something complex and boring)?(something complex and interesting)/ (note the ?:) to achieve this.
By the way, in Ruby 1.9, you can do /(something complex and boring)?(?<interesting>something complex and interesting)/ and access the group with $~[:interesting] ;)
Yup, use the ?: syntax:
/(?:something complex and boring)?(something complex and interesting)/
I'm not a ruby developer however I know other regex flavors. So I bet you can use a non capturing group
/(?:something complex and boring)?(something complex and interesting)/
There is only one capturing group, hence $1
HTH
Not really, no. But you can use a named group for uniformity, like this:
/(?<group1>something complex and boring)?(?<group2>something complex and interesting)/
You can change the names (the text in the angle brackets) for the uniformity that you want to achieve. You can then access the groups like this:
string.match(/(?<group1>something complex and boring)?(?<group2>something complex and interesting)/) do |m|
# Do something with the match, m['group'] can be used to access the group
end

Ruby gsub : is there a better way

I need to remove all leading and trailing non-numeric characters. This is what I came up with. Is there a better implementation.
puts s.gsub(/^\D+/,'').gsub(/\D+$/,'')
Instead of eliminating what you don't want, it's often clearer to select what you do want (using parentheses). Also, this only requires one regex evaluation:
s.match(/^\D*(.*?)\D*$/)[1]
Or, this convenient shorthand:
s[/^\D*(.*?)\D*$/, 1]
Perhaps a single #gsub(/(^\D+)|(\D+$)/, '')
Also, when in doubt Rubular it.

Optimal Regular Expression: match sets of lines starting with

Alright, this one's interesting. I have a solution, but I don't like it.
The goal is to be able to find a set of lines that start with 3 periods - not an individual line, mind you, but a collection of all the lines in a row that match. For example, here's some matches (each match is separated by a blank line):
...
...hello
...
...hello
...world
...
...wazzup?
...
My solution is as follows:
^\.\.\..*(\n\.\.\..*)*$
It matches all those, so it's what I'm using for now - however, it looks kinda silly to repeat the \.\.\..* pattern. Is there a simpler way?
Please test your regex before submitting it, rather than submit what "should work." For example, I tried the following first:
(^\.\.\..*$)+
which only returned individual lines, even though in my mind it looks like it would do the trick - I guess I just don't understand regex internals. (And no, I didn't need to set any flags to get ^ and $ to match line boundaries, since I'm implementing this in Ruby.)
So I'm not totally sure there's a good answer, but one would be much appreciated - thanks in advance!
In most regex implementations you can shorten \.\.\. using \.{3} so your solution would turn into \.{3}.*(\n\.{3}.*)*.
What you already have is already simple and understandable. Keep in mind that more "clever" RegExps may very well be slower and undoubtedly less readable.
Assuming lines are terminated by a \n:
((^|\n)\.{3}[^\n]*)+
I am not familiar with Ruby, so depending on how it returns matches you might need to "nonmatch" groups:
((?:(?:^|\n)\.{3}[^\n]*)+)
^([.]{3}.*$\n?)+
This doesn't really need $ in there.
You are pretty close to a solution with (^\.\.\..*$)+, but because the + modifier is on the outside of the group, it is getting overwritten each time and you are only left with the last line. Try wrapping it in an outer group: ((^\.\.\..*$)+) and looking at the first submatch and ignoring the inner one.
Combined with the other suggestion: ((^\.{3}.*$)+)

Resources