Ensure non-matching of a pattern within a scope - ruby

I am trying to create a regex that matches a pattern in some part of a string, but not in another part of the string.
I am trying to match a substring that
(i) is surrounded by a balanced pair of one or more consecutive backticks `
(ii) and does not include as many consecutive backticks as in the surrounding patterns
(iii) where the surrounding patterns (sequence of backticks) are not adjacent to other backticks.
This is some variant of the syntax of inline code notation in Markdown syntax.
Examples of matches are as follows:
"xxx`foo`yyy" # => matches "foo"
"xxx``foo`bar`baz``yyy" # => matches "foo`bar`baz"
"xxx```foo``bar``baz```yyy" # => matches "foo``bar``baz"
One regex to achieve this is:
/(?<!`)(?<backticks>`+)(?<inline>.+?)\k<backticks>(?!`)/
which uses a non-greedy match.
I was wondering if I can get rid of the non-greedy match.
The idea comes from when the prohibited pattern is a single character. When I want to match a substring that is surrounded by a single quote ' that does not include a single quote in it, I can do either:
/'.+?'/
/'[^']+'/
The first one uses non-greedy match, and the second one uses an explicit non-matching pattern [^'].
I am wondering if it is possible to have something like the second form when the prohibited pattern is not a single character.
Going back to the original issue, there is negative lookahead syntax(?!), but I cannot restrict its effective scope. If I make my regex like this:
/(?<!`)(?<backticks>`+)(?<inline>(?!.*\k<backticks>).*)\k<backticks>(?!`)/
then the effect of (?!.*\k<backticks>) will not be limited to within (?<inline>...), but will extend to the whole string. And since that contradicts with the \k<backticks> at the end, the regex fails to match.
Is there a regex technique to ensure non-matching of a pattern (not-necessarily a single character) within a certain scope?

You can search for one or more characters which aren't the first character of a delimiter:
/(?<!`)(?<backticks>`+)(?<inline>(?:(?!\k<backticks>).)+)\k<backticks>(?!`)/

Related

regexp match group with the exception of a member of the group

So, there are a number of regular expression which matches a particular group like the following:
/./ - Any character except a newline.
/./m - Any character (the m modifier enables multiline mode)
/\w/ - A word character ([a-zA-Z0-9_])
/\s/ - Any whitespace character
And in ruby:
/[[:punct:]]/ - Punctuation character
/[[:space:]]/ - Whitespace character ([:blank:], newline, carriage return, etc.)
/[[:upper:]]/ - Uppercase alphabetical
So, here is my question: how do I get a regexp to match a group like this, but exempt a character out?
Examples:
match all punctuations apart from the question mark
match all whitespace characters apart from the new line
match all words apart from "go"... etc
Thanks.
You can use character class subtraction.
Rexegg:
The syntax […&&[…]] allows you to use a logical AND on several character classes to ensure that a character is present in them all. Intersecting with a negated character, as in […&&[^…]] allows you to subtract that class from the original class.
Consider this code:
s = "./?!"
res = s.scan(/[[:punct:]&&[^!]]/)
puts res
Output is only ., / and ? since ! is excluded.
Restricting with a lookahead (as sawa has written just now) is also possible, but is not required when you have this subtraction supported. When you need to restrict some longer values (more than 1 character) a lookahead is required.
In many cases, a lookahead must be anchored to a word boundary to return correct results. As an example of using a lookahead to restrict punctuation (single character matching generic pattern):
/(?:(?!!)[[:punct:]])+/
This will match 1 or more punctuation symbols but a !.
The puts "./?!".scan(/(?:(?!!)[[:punct:]])+/) code will output ./? (see demo)
Use character class subtraction whenever you need to restrict with single characters, it is more efficient than using lookaheads.
So, the 3rd scenario regex must look like:
/\b(?!go\b)\w+\b/
^^
If you write /(?!\bgo\b)\b\w+\b/, the regex engine will check each position in the input string. If you use a \b at the beginning, only word boundary positions will be checked, and the pattern will yield better performance. Also note that the ^^ \b is very important since it makes the regex engine check for the whole word go. If you remove it, it will only restrict to the words that do not start with go.
Put what you want to exclude inside a negative lookahead in front of the match. For example,
To match all punctuations apart from the question mark,
/(?!\?)[[:punct:]]/
To match all words apart from "go",
/(?!\bgo\b)\b\w+\b/
This is a general approach that is sometimes useful:
a = []
".?!,:;-".scan(/[[:punct:]]/) { |s| a << s unless s == '?' }
a #=> [".", "!", ",", ":", ";", "-"]
The content of the block is limited only by your imagination.

preg_match search pattern, stop at character combination

I am trying to pull a whole Mysql statement from a database sql file
INSERT INTO `helppages`
(`HelpPageID`, `ShowHelpItem`, `HelpRank`, `HelpCategory`, `HelpTitle`, `HelpDescription`, `HelpLink`, `HelpText`, `CMSHelpBar`, `CMSHelpBarAdditional`)
VALUES (... characters (Too many to post here, but the expression below grabs all) ...
);
The current, though I have been through many variations, expression I am using is:
preg_match("#INSERT INTO `$SearchingTableName` ([!%&'-/:<=>#^`\;\s\d\w\"\#\$\(\)\*\+\,\.\?\[\]\{\}\(\)\\\|©]*?)\)\;\r\n#s", $uploadedfile, $matches);
which gets all the information but I can't get it to stop at the end ");\r\n"
also $SearchingTableName = helppages.
Edit
Sorry the current expression uses look forward
preg_match("#INSERT INTO `$SearchingTableName` ([!%&'-/:<=>#^`\;\s\d\w\"\#\$\(\)\*\+\,\.\?\[\]\{\}\(\)\\\|©]*)(?!\)\;\r\n)#s", $uploadedfile, $matches);
Also I checked with MSword using );^p and there is only one instance at the end of the Insert
To match this kind of string you can't do it only playing with character classes. You need to describe the string structure.
For this simple particular case you can use this pattern:
$pattern = <<<EOD
~
# definitions
(?(DEFINE)
(?<elt> [^"',)]+ | '(?>[^\\']+|\\.)*' | "(?>[^\\"]+|\\.)*" )
(?<list> \( \g<elt>? (?: \s* , \s* \g<elt> )* \) )
)
# main pattern
INSERT \s+ (?:INTO \s+)? `$SearchingTableName` \s* \g<list>? \s* VALUES \s*
\g<list> \s* (?: , \s* \g<list> \s* )* ;
~xs
EOD;
if (preg_match_all($pattern, $uploadedfile, $m))
print_r($m[0]);
online demo
But keep in mind that parsing a programming language is not an easy task and is full of traps (depending of the syntax) even for the capabilities of the PHP regex engine. (It's however possible.)
regex features used here:
delimiters and modifiers:
The pattern delimiter used here is ~ instead of the classical /. There is no literal ~ in the pattern thus it's ok.
The pattern uses two modifiers: s and x:
by default the . can't match the newline character \n. The s modifier (s for singleline mode) changes this behavior. When used the . can match all characters including the newline character. (Note that you can retrieve this default behavior using \N that doesn't match the newline character whatever the mode.)
x switches on the extended mode. In this mode, whitespaces inside the pattern are ignored. This mode allows too inline comments that begin with a sharp character #. This mode is very useful to make readable long patterns using spaces, indentation and comments.
using named captures
When you have a long pattern and when you need to reuse several times the same subpatterns, you have the possibility to reuse subpatterns that are written inside capture groups.
A quick example:
You want to match several items separated by commas and composed with 4 digits and 4 letters like this: 1234abcd,5678efgh,9012ijkl,3456mnop.
The pattern to do that is obviously ^\d{4}[a-z]{4}(?:,\d{4}[a-z]{4})+$
But if I don't want to write \d{4}[a-z]{4} two times, I can put it in a capture group and use an alias for the subpattern in the capture group, like this: ^(\d{4}[a-z]{4})(?:,(?1))+$.
Here the (?1) is an alias for the subpattern inside the capture group 1 (not the content matched by the subpattern as a backreference \1 does, but the subpattern itself) that is \d{4}[a-z]{4}.
PCRE, the regex engine used by PHP supports this syntax too \g<1> instead of (?1).
But if you have a lot of capture groups in the pattern, it is not always handy to remember what's the number of the capture group you need. This is the reason why you have the possibility to name capturing groups. Example: ^(?<diglet>\d{4}[a-z]{4})(?:,\g<diglet>)+$
The other advantage of named patterns, except to make the whole pattern more readable, is to add a semantical dimension to the pattern, in the same way you can do it by addying an id attribute to an html tag.
definition section
Instead of defining the named subpattern directly in the main pattern like in the previous example, you can use a definition section to put all the subpatterns that would be used in the main pattern. Note that all that is inside this section is only here for definition purpose and doesn't match nothing. It's like a zero-width assertion.
The syntax of this section is : (?(DEFINE)(?<diglet>\d{4}[a-z]{4})) (you can put several named subpatterns inside.). The precedant pattern becomes:(?(DEFINE)(?<diglet>\d{4}[a-z]{4}))^\g<diglet>(?:,\g<diglet>)+$
the pattern itself:
The first part of the pattern enclosed between (?(DEFINE) and ) consists of subpatterns definitions that will be used later in the main pattern.
The elt subpattern describes an item (a column name or a value):
[^"',)]+ # all that is not a quote a comma or a closing parenthese:
# in the present context this will match numbers and column names
| # OR
'(?>[^\\']+|\\.)*' # string between single quotes (designed to deal with escaped quotes)
|
"(?>[^\\"]+|\\.)*" # same for double quotes
The list subpattern describes the full list of elements separated by commas between parenthesis. Note that this subpattern use a reference to the elt subpattern.
The main pattern needs only to reuse the subpattern list.

Precedence of Ruby regular expressions?

I am reviewing regular expressions and cannot understand why a regular expression won't match a given string, specifically:
regex = /(ab*)+(bc)?/
mystring = "abbc"
The match matches "abb" but leaves the c off. I tested this using Rubular and in IRB and don't understand why the regex doesn't match the entire string. I thought that (ab*)+ would match "ab" and then (bc)? would match "bc".
Am I missing something in terms of precedence for regular expression operations?
Regular expressions try to match the first part of the regular expression as much as possible by default, and they do not backtrack to try to make larger sections match if they don't have to. Since you make (bc) optional, the (ab*) can match as much as it wants (the non-zero repetition after it doesn't have much to do) and doesn't try backtracking to try other matching alternatives.
If you want the whole string to be matched (which will force some backtracking in this case) make sure you anchor both ends of the string:
regex = /^(ab*)+(bc)?$/
The regex with parenthesis assumes you have two matches in your string.
The first one is abb because (ab*) means a and zero or more b. You have two b, so the match is abb. Then you have only c in your string, so it doesn't match the second condition which is bc.

How to check how many variables (masks) declared in Regexp (ruby)?

Let's say I have a regexp with some arbitrary amount of capturing groups:
pattern = /(some)| ..a lot of masks combined.... |(other)/
Is there any way to determine a number of that groups?
If you can always find a string that matches the regex you are given, then it suffices to match it against the regex, and look at the match data length. However, determining whether a regexp has a string that it matches is np-hard[1]. This is only feasible if you know in advance what kind of regexes you'll be getting.
The next best best method in the Regexp class is Regexp#source or Regexp#to_s. However, we need to parse the regex if we do this.
I can't speak for the future, but as of Ruby 2.0, there is no better method in the Regexp core class.
A left parenthesis denotes a literal left parenthesis, if preceded by an unescaped backslash. A backslash is unescaped unless an unescaped backslash precedes. So, a character is escaped iff preceded by an odd number of backslashes.
An unescaped left parenthesis denotes a capturing group iff not followed by a question mark. With a question mark, it can mean various things: (?'name') and (?<name>) denote a named capturing group. Named and unnamed capturing groups cannot coexist in the same regex, however[2]. (?:) denote non-capturing groups. This is a special case of (?flags-flags:). (?>) denote atomic groups. (?=), (?!), (?<=) and (?<!) denote lookaround. (?#) denote comments.
Ruby regexp engine supports comments in regexes. Considering them in the main regex would be very difficult. We can try to strip them if we really want to support these, but supporting them fully will get messy due to the possibility of inline flags turning extended mode (and thus line comments) on and off in ways that a regular expression cannot capture. I will go ahead and not support unescaped parentheses in regex comments[3].
We want to count:
the number of left parentheses \(
that are not escaped by a backslash (?<!(?<!\\)(?:\\\\)*\\) (read: not preceded by an odd number of backslashes that are not preceded by yet another backslash) and
that are not followed by a question mark (?!\?)
Ruby doesn't support unbounded lookbehind, but if we reverse the source first, we can rewrite the first assertion slightly: (?!(?:\\\\)*(?!\\)). The second assertion becomes a lookbehind: (?<!\?).
the whole solution
def count_groups(regexp)
# named capture support:
# named_count = regexp.named_captures.count
# return named_count if named_count > 0
# main:
test = /(?!<\?)\((?!(?:\\\\)*(?!\\))/
regexp.source.scan(test).count
end
[1]: we can show the NP-hardness by converting the satisfiability problem to it:
AND: xy (x must be an assertion)
OR: x|y
NOT: (?!x)
atoms: (?=1), (?=.1), (?=..1), ..., (?!1), (?!.1)...
example(XOR): /^(?:(?=1)(?!.1)|(?!1)(?=.1))..$/
this extends to NP-completeness for any class of regexes that can be tested in polynomial time. This includes any regex with no nested repetition (or repeated backreferences to repetition or recursion) and with bounded nesting depth of optional matches.
[2]: /((?<name>..)..)../.match('abcdef').to_a returns ['abcdef', 'ab'], indicating that unnamed capturing groups are ignored when named capturing groups are present. Tested in Ruby 1.9.3
[3]: Inline comments start with (?# and end with ). They cannot contain an unescaped right parenthesis, but they can contain an unescaped left parenthesis. These can be stripped easily (even though we have to sprinkle the "unescaped" regex everywhere), are the lesser evil, but they're also less likely to contain anunescaped left parenthesis.
Line comments start with # and end with a newline. These are only treated as comments in the extended mode. Outside the extended mode, they match the literal # and newline. This is still easy, even if we have to consider escaping again. Determining if the regex has the extended flag set is not too difficult, but the flag modifier groups are a different beast entirely.
Even with Ruby's awesome recursive regexes, merely determining if a previously-open group modifying the extended mode is already closed would yield a very nasty regex (even if you replace one by one and don't have to skip comments, you have to account for escaping). It wouldn't be pretty (even with interpolation) and it wouldn't be fast.

How to use RegEx to replace items based on their context, without affecting the context

Using Ruby, I am writing a regular expression, and I need to be a able to remove any colon that appears between parentheses. I understand that I can use
"This is a (string :)".sub!(/\([^\)]*:/, '')
to do this, but the problem is that this function will also remove the context along with it. Is there any way to specify that I only want it to remove the colon and not the entire matching expression?
So some regular expression engines support what are called look-ahead and look-behind matches that will match but not consume characters. Ruby does support look-ahead, but not look-behind (which is more difficult to do in a performant way), which means you could quite easily stick with sub and remove a colon that precedes a closing parenthesis, but only without ensuring it is after an opening parenthesis:
string = 'This is a (string :)'
string.sub /:(?=\))/, ''
# => 'This is a (string )'
The alternative would be to use subpattern capturing (which happens automatically when you use grouping in your regular expression) to rebuild the string without the undesirable portion, in this case the colon:
string.sub /(\([^:]+):\)/, '\1)'
The \1 is a back-reference to what is matched in the first group, which is delimited by the parentheses that are not escaped. You can see here I didn't bother capturing the closing parenthesis in a second group, opting instead simply to include it in the substitution. This works well in this case because it will not change, but if you don't know that the colon will appear at the end of the parentheses-enclosed content, you would need a second group:
string.sub /(\([^:]+):([^)]+\))/, '\1\2'
The prior answer will mostly work for deleting single colons within paren groups, but have trouble with multiples like '(thing:foo:bar)`. It would be nice to use lookbehind and lookahead to make the within parens assertion, but ruby (and most regexp engines) doesn't support non-deterministic length patterns in lookbehind.
irb> s = 'x (a:b:c) : (1:2:3) y'
=> "x (a:b:c) : (1:2:3) y"
irb> s.gsub /(?<=\([^\(]*):(?=[^\)]*\))/, ''
SyntaxError: (irb):10: invalid pattern in look-behind: /(?<=\([^\(]*):(?=[^\)]*\))/
from /Users/dbenhur/.rbenv/versions/1.9.2-wp/bin/irb:12:in `<main>'
You could instead use the block form of gsub to capture paren enclosed groups, then remove colons from each match:
irb> s.gsub(/\([^\)]*\)/) {|m| m.delete ':'}
=> "x (abc) : (123) y"
in regex in general, you can use '(\()(:)(\))', \1\3.
I'm not familiar with Ruby. Basically what you do is you have 3 groups, and from this three groups ( : and ) you get rid of the second one, the :.
I tested it in Notepad++ and it works.
I think this is called: regex backreference
Cheers.
If you can assume all parentheses will come in balanced pairs like they do in your example, this should be all you need:
"This is a (string :)".gsub!(/:(?=[^()]*\))/, '')
If the lookahead succeeds in finding a closing paren without seeing an opening paren first, the colon must be inside a (...) sequence. Notice how I excluded the opening paren as well as the closing paren; that's essential.

Resources