Matching braces in ruby with a character in front - ruby

I have read quite a few posts here for matching nested braces in Ruby using Regexp. However I cannot adapt it to my situation and I am stuck. The Ruby 1.9 book uses the following to match a set of nested braces
/\A(?<brace_expression>{([^{}]|\g<brace_expression>)*})\Z/x
I am trying to alter this in three ways. 1. I want to use parentheses instead of braces, 2. I want a character in front (such as a hash symbol), and 3. I want to match anywhere in the string, not just beginning and end. Here is what I have so far.
/(#(?<brace_expression>\(([^\(\)]|\g<brace_expression>)*\)))/x
Any help in getting the right expression would be appreciated.

Using the regex modifier x enables comments in the regex. So the # in your regex is interpreted as a comment character and the rest of the regex is ignored. You'll need to either escape the # or remove the x modifier.
Btw: There's no need to escape the parentheses inside [].

Related

How can I check for repeated strings with check-tail plugin in Sensu?

I am using sensu and the check-tail.rb plugin to alert if any errors appear in my app logs. The problem is that I want the check to be successful if it finds 3 or more error messages.
The solution that I came up with is using a regex like:
\^.*"status":503,.*$.*^.*"status":503,.*$.*^.*"status":503,.*$\im
But it seems to not work because of the match function: instead of passing the variable as a ruby regex it passes it as a string (this can be seen here).
You need to pass the pattern as a string literal, not as a Regexp object.
Thus, you need to remove the regex delimiters and change the modifiers to their inline option variants, that is, prepend the pattern with (?im).
(?im)\A.*"status":503,.*$.*^.*"status":503,.*$.*^.*"status":5‌​03,.*\z
Note that to match the start of string in Ruby, you need to use \A and to match the end of string, you need to use \z anchors.

What is the difference between these three alternative ways to write Ruby regular expressions?

I want to match the path "/". I've tried the following alternatives, and the first two do match, but I don't know why the third doesn't:
/\A\/\z/.match("/") # <MatchData "/">
"/\A\/\z/".match("/") # <MatchData "/">
Regexp.new("/\A\/\z/").match("/") # nil
What's going on here? Why are they different?
The first snippet is the only correct one.
The second example is... misleading. That string literal "/\A\/\z/" is, obviously, not a regex. It's a string. Strings have #match method which converts its argument to a regexp (if not already one) and match against it. So, in this example, it's '/' that is the regular expression, and it matches a forward slash found in the other string.
The third line is completely broken: don't need the surrounding slashes there, they are part of regex literal, which you didn't use. Also use single quoted strings, not double quoted (which try to interpret escape sequences like \A)
Regexp.new('\A/\z').match("/") # => #<MatchData "/">
And, of course, none of the above is needed if you just want to check if a string consists of only one forward slash. Just use the equality check in this case.
s == '/'

Double escape characters in elisp regex patterns

(regexp-opt '("this" "that"))
returns,
"\\(?:th\\(?:at\\|is\\)\\)
Why there are double backward slashes everywhere in this elisp regex. Doesn't elisp regex use single backward slash?
And, ? symbol is a postfix operator in regex patterns which means it acts upon the characters that precedes it..(http://www.gnu.org/software/emacs/manual/html_node/elisp/Regexp-Special.html#Regexp-Special). but here, there are no expressions before the ? operator. so, what does
(?:th\\
part mean in this regex.
The backslash is part of the regexp syntax. But to preserve it as part of a regexp string, you need to protect it with another backslash, as documented in the syntax for strings documentation:
'Likewise, you can include a backslash by preceding it with another backslash, like this: "this \\ is a single embedded backslash".'
As for the ?: construct, it's how you specify a non-capturing or "shy" group:
"A shy group serves the first two purposes of an ordinary group (controlling the nesting of other operators), but it does not get a number, so you cannot refer back to its value with ‘\digit’. Shy groups are particularly useful for mechanically-constructed regular expressions, because they can be added automatically without altering the numbering of ordinary, non-shy groups."
It's documented as part of the regexp backslash documentation. As the passage quoted above explains, it's useful in functions like regexp-opt for grouping patterns without creating capture groups.

How to check how many variables (masks) declared in Regexp (ruby)?

Let's say I have a regexp with some arbitrary amount of capturing groups:
pattern = /(some)| ..a lot of masks combined.... |(other)/
Is there any way to determine a number of that groups?
If you can always find a string that matches the regex you are given, then it suffices to match it against the regex, and look at the match data length. However, determining whether a regexp has a string that it matches is np-hard[1]. This is only feasible if you know in advance what kind of regexes you'll be getting.
The next best best method in the Regexp class is Regexp#source or Regexp#to_s. However, we need to parse the regex if we do this.
I can't speak for the future, but as of Ruby 2.0, there is no better method in the Regexp core class.
A left parenthesis denotes a literal left parenthesis, if preceded by an unescaped backslash. A backslash is unescaped unless an unescaped backslash precedes. So, a character is escaped iff preceded by an odd number of backslashes.
An unescaped left parenthesis denotes a capturing group iff not followed by a question mark. With a question mark, it can mean various things: (?'name') and (?<name>) denote a named capturing group. Named and unnamed capturing groups cannot coexist in the same regex, however[2]. (?:) denote non-capturing groups. This is a special case of (?flags-flags:). (?>) denote atomic groups. (?=), (?!), (?<=) and (?<!) denote lookaround. (?#) denote comments.
Ruby regexp engine supports comments in regexes. Considering them in the main regex would be very difficult. We can try to strip them if we really want to support these, but supporting them fully will get messy due to the possibility of inline flags turning extended mode (and thus line comments) on and off in ways that a regular expression cannot capture. I will go ahead and not support unescaped parentheses in regex comments[3].
We want to count:
the number of left parentheses \(
that are not escaped by a backslash (?<!(?<!\\)(?:\\\\)*\\) (read: not preceded by an odd number of backslashes that are not preceded by yet another backslash) and
that are not followed by a question mark (?!\?)
Ruby doesn't support unbounded lookbehind, but if we reverse the source first, we can rewrite the first assertion slightly: (?!(?:\\\\)*(?!\\)). The second assertion becomes a lookbehind: (?<!\?).
the whole solution
def count_groups(regexp)
# named capture support:
# named_count = regexp.named_captures.count
# return named_count if named_count > 0
# main:
test = /(?!<\?)\((?!(?:\\\\)*(?!\\))/
regexp.source.scan(test).count
end
[1]: we can show the NP-hardness by converting the satisfiability problem to it:
AND: xy (x must be an assertion)
OR: x|y
NOT: (?!x)
atoms: (?=1), (?=.1), (?=..1), ..., (?!1), (?!.1)...
example(XOR): /^(?:(?=1)(?!.1)|(?!1)(?=.1))..$/
this extends to NP-completeness for any class of regexes that can be tested in polynomial time. This includes any regex with no nested repetition (or repeated backreferences to repetition or recursion) and with bounded nesting depth of optional matches.
[2]: /((?<name>..)..)../.match('abcdef').to_a returns ['abcdef', 'ab'], indicating that unnamed capturing groups are ignored when named capturing groups are present. Tested in Ruby 1.9.3
[3]: Inline comments start with (?# and end with ). They cannot contain an unescaped right parenthesis, but they can contain an unescaped left parenthesis. These can be stripped easily (even though we have to sprinkle the "unescaped" regex everywhere), are the lesser evil, but they're also less likely to contain anunescaped left parenthesis.
Line comments start with # and end with a newline. These are only treated as comments in the extended mode. Outside the extended mode, they match the literal # and newline. This is still easy, even if we have to consider escaping again. Determining if the regex has the extended flag set is not too difficult, but the flag modifier groups are a different beast entirely.
Even with Ruby's awesome recursive regexes, merely determining if a previously-open group modifying the extended mode is already closed would yield a very nasty regex (even if you replace one by one and don't have to skip comments, you have to account for escaping). It wouldn't be pretty (even with interpolation) and it wouldn't be fast.

How to use RegEx to replace items based on their context, without affecting the context

Using Ruby, I am writing a regular expression, and I need to be a able to remove any colon that appears between parentheses. I understand that I can use
"This is a (string :)".sub!(/\([^\)]*:/, '')
to do this, but the problem is that this function will also remove the context along with it. Is there any way to specify that I only want it to remove the colon and not the entire matching expression?
So some regular expression engines support what are called look-ahead and look-behind matches that will match but not consume characters. Ruby does support look-ahead, but not look-behind (which is more difficult to do in a performant way), which means you could quite easily stick with sub and remove a colon that precedes a closing parenthesis, but only without ensuring it is after an opening parenthesis:
string = 'This is a (string :)'
string.sub /:(?=\))/, ''
# => 'This is a (string )'
The alternative would be to use subpattern capturing (which happens automatically when you use grouping in your regular expression) to rebuild the string without the undesirable portion, in this case the colon:
string.sub /(\([^:]+):\)/, '\1)'
The \1 is a back-reference to what is matched in the first group, which is delimited by the parentheses that are not escaped. You can see here I didn't bother capturing the closing parenthesis in a second group, opting instead simply to include it in the substitution. This works well in this case because it will not change, but if you don't know that the colon will appear at the end of the parentheses-enclosed content, you would need a second group:
string.sub /(\([^:]+):([^)]+\))/, '\1\2'
The prior answer will mostly work for deleting single colons within paren groups, but have trouble with multiples like '(thing:foo:bar)`. It would be nice to use lookbehind and lookahead to make the within parens assertion, but ruby (and most regexp engines) doesn't support non-deterministic length patterns in lookbehind.
irb> s = 'x (a:b:c) : (1:2:3) y'
=> "x (a:b:c) : (1:2:3) y"
irb> s.gsub /(?<=\([^\(]*):(?=[^\)]*\))/, ''
SyntaxError: (irb):10: invalid pattern in look-behind: /(?<=\([^\(]*):(?=[^\)]*\))/
from /Users/dbenhur/.rbenv/versions/1.9.2-wp/bin/irb:12:in `<main>'
You could instead use the block form of gsub to capture paren enclosed groups, then remove colons from each match:
irb> s.gsub(/\([^\)]*\)/) {|m| m.delete ':'}
=> "x (abc) : (123) y"
in regex in general, you can use '(\()(:)(\))', \1\3.
I'm not familiar with Ruby. Basically what you do is you have 3 groups, and from this three groups ( : and ) you get rid of the second one, the :.
I tested it in Notepad++ and it works.
I think this is called: regex backreference
Cheers.
If you can assume all parentheses will come in balanced pairs like they do in your example, this should be all you need:
"This is a (string :)".gsub!(/:(?=[^()]*\))/, '')
If the lookahead succeeds in finding a closing paren without seeing an opening paren first, the colon must be inside a (...) sequence. Notice how I excluded the opening paren as well as the closing paren; that's essential.

Resources