Ruby regex for text within parentheses - ruby

I am looking for a regex to replace all terms in parentheses unless the parentheses are within square brackets.
e.g.
(matches) #match
[(do not match)] #should not match
[[does (not match)]] #should not match
I current have:
[^\]]\([^()]*\) #Not a square bracket, an opening bracket, any non-bracket character and a closing bracket.
However this is still matching words within the square brackets.
I have also created a rubular page of my progress so far: http://rubular.com/r/gG22pFk2Ld

A regex is not going to cut it for you if you can nest the square brackets (see this related question).
I think you can only do this with a regex if (a) you only allow one level of square brackets and (b) you assume all square brackets are properly matched. In that case
\([^()]*\)(?![^\[]*])
is sufficient - it matches any parenthesised expression not followed by an unpaired ]. You need (b) because of the limitations of negative lookbehind (only fixed length strings in 1.9, and not allowed at all in 1.8), which mean you are stuck matching (match)] even if you don't want to.
So basically if you need to nest, or to allow unmatched brackets, you should ditch the regex and look at the answer to the question I linked to above.

This is a type of expression you cannot parse using a pure-regex approach, because you need to keep track of the current nesting/state_if_in_square_bracket (so you don't have a type 3 language anymore).
However, depending on the exact circumstances, you can parse it with multiple regexes or simple parsers. Example approaches:
Split into sub-strings, delimited by
[/[[or ]/]], change the state
when such a square bracket is
encountered, replace () in a
sub-string if in
"not_in_square_bracket" state
Parse for square brackets (including content), remove & remember them (these are "comments"), now replace all the content in normal brackets and re-add the square brackets stuff (you can remember stuff by using unique temp strings)
The complexity of your solution also depends on the detail if escaping ] is allowed.

Related

Ensure non-matching of a pattern within a scope

I am trying to create a regex that matches a pattern in some part of a string, but not in another part of the string.
I am trying to match a substring that
(i) is surrounded by a balanced pair of one or more consecutive backticks `
(ii) and does not include as many consecutive backticks as in the surrounding patterns
(iii) where the surrounding patterns (sequence of backticks) are not adjacent to other backticks.
This is some variant of the syntax of inline code notation in Markdown syntax.
Examples of matches are as follows:
"xxx`foo`yyy" # => matches "foo"
"xxx``foo`bar`baz``yyy" # => matches "foo`bar`baz"
"xxx```foo``bar``baz```yyy" # => matches "foo``bar``baz"
One regex to achieve this is:
/(?<!`)(?<backticks>`+)(?<inline>.+?)\k<backticks>(?!`)/
which uses a non-greedy match.
I was wondering if I can get rid of the non-greedy match.
The idea comes from when the prohibited pattern is a single character. When I want to match a substring that is surrounded by a single quote ' that does not include a single quote in it, I can do either:
/'.+?'/
/'[^']+'/
The first one uses non-greedy match, and the second one uses an explicit non-matching pattern [^'].
I am wondering if it is possible to have something like the second form when the prohibited pattern is not a single character.
Going back to the original issue, there is negative lookahead syntax(?!), but I cannot restrict its effective scope. If I make my regex like this:
/(?<!`)(?<backticks>`+)(?<inline>(?!.*\k<backticks>).*)\k<backticks>(?!`)/
then the effect of (?!.*\k<backticks>) will not be limited to within (?<inline>...), but will extend to the whole string. And since that contradicts with the \k<backticks> at the end, the regex fails to match.
Is there a regex technique to ensure non-matching of a pattern (not-necessarily a single character) within a certain scope?
You can search for one or more characters which aren't the first character of a delimiter:
/(?<!`)(?<backticks>`+)(?<inline>(?:(?!\k<backticks>).)+)\k<backticks>(?!`)/

NP++: Regular expression

I have a text with many expressions like this <.....>, e.g.:
<..> Text1 <.sdfdsvd> Text 2 <....dgdfg> Text3 <...something> Text4
How can I eliminate now all brackets <...> and all commands/texts between these brackets? But the other "real" text between these (like text1, text2 above) should not be touched.
I tried with the regular expression:
<.*>
But this finds also a block like this, including the inbetween text:
<..> Text1 <.sdfdsvd>
My second try was to search for alle expressions <.> without a third bracket between these two, so I tried:
<.*[^>^<]>
But that does not work either, no change in behavior. How to construct the needed expression correctly?
This works in Notepad++:
Find what: <[^>]+?>
Replace with: nothing
Try it out: http://regex101.com/r/lC9mD4
There are a few problems with your attempt: <.*[^>^<]>
.* matches all characters up through the final possible match. This means that all tags except the last will be bypassed. This is called greedy. In my solution, I have changed it to possessive, which goes up to the first possible match: .*?...although I apply this to the character class itself: [^>]+?.
[^>^<] is incorrect for two reasons, one small, one big. The small reason is that the first caret ^ says "do not match any of the following characters", and the characters following it are >, ^, and <. So you are saying you don't want to match the caret character, which is incorrect (but not harmful). The larger problem is that this is attempting to match exactly one character, when it needs to be one or more, which is signified by the plus sign: [^><]+.
Otherwise, your attempt is not that far off from my solution.
This seems to work:
<[^\s]*>
It looks for a left bracket, then anything that isn't whitespace between the brackets, then a right bracket. It would need some adjusting if there's whitespace between the brackets (<text1 text2>), though, and at that point a modification of one of your attempts would work better:
<[^<^>]*>
This one looks for a left bracket, then anything that isn't a left bracket or right bracket, then a right bracket.
Try <.*?>. If you don't use the "?", regular expressions will try to find the longest string that matches. Using "*?" will force to find the shortest.

Minus sign that appears not to designate a range and escaping a closing parenthesis

I have 2 questions regarding the following regex from Why's Poignant Guide to Ruby:
1: What does the minus sign mean here? It doesn't seem to be designating a range because there is nothing to the left of it other than the bracket.
2: Why is it necessary to escape the closing parenthesis? After you escape the opening one, what special meaning could the closing parenthesis have?
/\([-\w]+\)/
1)When the minus sign is at the begining or at the end of a character class, it is seen as literal.
2) escaping closing parenthesis is a convention. The goal is IMO, to avoid an ambiguity with a possible opening parenthesis before. Consider these examples:
/(\([-\w]+\))/ or /(\([-\w]+)\)/
1) The minus sign is a literal minus sign. Since it cannot possibly designate a range, it has no special meaning and so the character class is equivalent to [\-\w] - escaping the hyphen is optional, as you observe in your second point...
2) ...however, it isn't always good form to not escape something just because the regular expression engine allows it. For example, this regex: ([([^)-]+) is perfectly valid (I think...) but entirely unclear because of the fact that characters which normally have special meanings are used as literal characters without being escaped. Valid, yes, but not obvious, and someone who doesn't know all the rules will become very confused trying to understand it.
The the minus sign -, or say the hyphen, means exact just the character -. The hyphen can be included right after the opening bracket, or right before the closing bracket, or right after the negating caret. It's not designating a range, so it's not confusing. You can also choose to use \- if you like.
As to why to escape ), I think it means to reduce the regex engine's work so that it doesn't have to remember if an opening parenthesis is before.
- sign in this regex actually means a - sign that you want to see in the text.
Non-escaped parentheses means a match group, that will be available for you, for example, by $1 variable.
> "(-w)" =~ /\([-\w]+\)/
> $1 # => nil
and
> "(-w)" =~ /([-\w]+)/
> $1 # => -w
You can go to Rubular and try both regexes \([-\w]+\) and ([-\w]+) - and you will see different results by passing (-w) as a test. You can notice match groups appearing.

How to check how many variables (masks) declared in Regexp (ruby)?

Let's say I have a regexp with some arbitrary amount of capturing groups:
pattern = /(some)| ..a lot of masks combined.... |(other)/
Is there any way to determine a number of that groups?
If you can always find a string that matches the regex you are given, then it suffices to match it against the regex, and look at the match data length. However, determining whether a regexp has a string that it matches is np-hard[1]. This is only feasible if you know in advance what kind of regexes you'll be getting.
The next best best method in the Regexp class is Regexp#source or Regexp#to_s. However, we need to parse the regex if we do this.
I can't speak for the future, but as of Ruby 2.0, there is no better method in the Regexp core class.
A left parenthesis denotes a literal left parenthesis, if preceded by an unescaped backslash. A backslash is unescaped unless an unescaped backslash precedes. So, a character is escaped iff preceded by an odd number of backslashes.
An unescaped left parenthesis denotes a capturing group iff not followed by a question mark. With a question mark, it can mean various things: (?'name') and (?<name>) denote a named capturing group. Named and unnamed capturing groups cannot coexist in the same regex, however[2]. (?:) denote non-capturing groups. This is a special case of (?flags-flags:). (?>) denote atomic groups. (?=), (?!), (?<=) and (?<!) denote lookaround. (?#) denote comments.
Ruby regexp engine supports comments in regexes. Considering them in the main regex would be very difficult. We can try to strip them if we really want to support these, but supporting them fully will get messy due to the possibility of inline flags turning extended mode (and thus line comments) on and off in ways that a regular expression cannot capture. I will go ahead and not support unescaped parentheses in regex comments[3].
We want to count:
the number of left parentheses \(
that are not escaped by a backslash (?<!(?<!\\)(?:\\\\)*\\) (read: not preceded by an odd number of backslashes that are not preceded by yet another backslash) and
that are not followed by a question mark (?!\?)
Ruby doesn't support unbounded lookbehind, but if we reverse the source first, we can rewrite the first assertion slightly: (?!(?:\\\\)*(?!\\)). The second assertion becomes a lookbehind: (?<!\?).
the whole solution
def count_groups(regexp)
# named capture support:
# named_count = regexp.named_captures.count
# return named_count if named_count > 0
# main:
test = /(?!<\?)\((?!(?:\\\\)*(?!\\))/
regexp.source.scan(test).count
end
[1]: we can show the NP-hardness by converting the satisfiability problem to it:
AND: xy (x must be an assertion)
OR: x|y
NOT: (?!x)
atoms: (?=1), (?=.1), (?=..1), ..., (?!1), (?!.1)...
example(XOR): /^(?:(?=1)(?!.1)|(?!1)(?=.1))..$/
this extends to NP-completeness for any class of regexes that can be tested in polynomial time. This includes any regex with no nested repetition (or repeated backreferences to repetition or recursion) and with bounded nesting depth of optional matches.
[2]: /((?<name>..)..)../.match('abcdef').to_a returns ['abcdef', 'ab'], indicating that unnamed capturing groups are ignored when named capturing groups are present. Tested in Ruby 1.9.3
[3]: Inline comments start with (?# and end with ). They cannot contain an unescaped right parenthesis, but they can contain an unescaped left parenthesis. These can be stripped easily (even though we have to sprinkle the "unescaped" regex everywhere), are the lesser evil, but they're also less likely to contain anunescaped left parenthesis.
Line comments start with # and end with a newline. These are only treated as comments in the extended mode. Outside the extended mode, they match the literal # and newline. This is still easy, even if we have to consider escaping again. Determining if the regex has the extended flag set is not too difficult, but the flag modifier groups are a different beast entirely.
Even with Ruby's awesome recursive regexes, merely determining if a previously-open group modifying the extended mode is already closed would yield a very nasty regex (even if you replace one by one and don't have to skip comments, you have to account for escaping). It wouldn't be pretty (even with interpolation) and it wouldn't be fast.

Removing parenthesis and digit from string with regex

I have strings that look like this:
Executive Producer (3)
Producer (0)
1st Assistant Camera (12)
I'd like to use a regex to match the first part of the string and to remove the " (num)" part (the space preceding the parentheses and the parenthesis/digit in the parentheses). After using the regex I'd want to have my vars equal to: "Executive Producer", "Producer", "1st Assistant Camera"
If you know any resources for learning regexes that would be great too.
You just have to select all the characters except the final parenthesis and their numeric content:
(.+) \(\d+\)
The first two parenthesis capture the content (here, all content, declared by the point). Then, you want two parenthesis (be careful to the slash), meaning we do not want these parenthesis to capture the "\d+" expression, which is a number.
One of my favorite regex site: http://www.regular-expressions.info/
Maybe s/([\s\w]+\w)\s*\(\d+\)/\1/?
I don't know Ruby, so you'd have to translate it to its own regexp syntax.

Resources