meaning of a `+` following a `*`, when the latter is used as a quantifier in a regular expression - ruby

Today I came across the following regular expression and wanted to know what Ruby would do with it:
> "#a" =~ /^[\W].*+$/
=> 0
> "1a" =~ /^[\W].*+$/
=> nil
In this instance, Ruby seems to be ignoring the + character. If that is incorrect, I'm not sure what it is doing with it. I'm guessing it's not being interpreted as a quantifier, since the * is not escaped and is being used as a quantifier. In Perl/Ruby regexes, sometimes when a character (e.g., -) is used in a context in which it cannot be interpreted as a special character, it is treated as a literal. But if that was happening in this case, I would expect the first match to fail, since there is no + in the lvalue string.
Is this a subtly correct use of the + character? Is the above behavior a bug? Am I missing something obvious?

Well, you can certainly use a + after a *. You can read a bit about it on this site. The + after the * is called a possessive quantifier.
What it does? It prevents * from backtracking.
Ordinarily, when you have something like .*c and using this to match abcde, the .* will first match the whole string (abcde) and since the regex cannot match c after the .*, the engine will go back one character at a time to check if there is a match (this is backtracking).
Once it has backtracked to c, you will get the match abc from abcde.
Now, imagine that the engine has to backtrack a few hundred characters, and if you have nested groups and multiple * (or + or the {m,n} form), you can quickly end up with thousands, millions of characters to backtrack, called catastrophic backtracking.
This is where possessive quantifiers come in handy. They actually prevent any form of backtracking. In the above regex I mentioned, abcde will not be matched by .*+c. Once .*+ has consumed the whole string, it cannot backtrack and since there's no c at the end of the string, the match fails.
So, another possible use of possessive quantifiers is that they can improve the performance of some regexes, provided the engine can support it.
For your regex /^[\W].*+$/, I don't think that there's any improvement (maybe a tiny little improvement) that the possessive quantifier provides though. And last, it might easily be rewritten as /^\W.*+$/.

Related

Generating a character class

I'm trying to censor letters in a word with word.gsub(/[^#{guesses}]/i, '-'), where word and guesses are strings.
When guesses is "", I get this error RegexpError: empty char-class: /[^]/i. I could sort such cases with an if/else statement, but can I add something to the regex to make it work in one line?
Since you are only matching (or not matching) letters, you can add a non-letter character to your regex, e.g. # or %:
word.gsub(/[^%#{guesses}]/i, '-')
See IDEONE demo
If #{guesses} is empty, the regex will still be valid, and since % does not appear in a word, there is no risk of censuring some guessed percentage sign.
You have two options. One is to avoid testing if your matches are empty, that is:
unless (guesses.empty?)
word.gsub(/^#{Regex.escape(guesses)}/i, '-')
end
Although that's not your intention, it's really the safest plan here and is the most clear in terms of code.
Or you could use the tr function instead, though only for non-empty strings, so this could be substituted inside the unless block:
word.tr('^' + guesses.downcase + guesses.upcase, '-')
Generally tr performs better than gsub if used frequently. It also doesn't require any special escaping.
Edit: Added a note about tr not working on empty strings.
Since tr treats ^ as a special case on empty strings, you can use an embedded ternary, but that ends up confusing what's going on considerably:
word.tr(guesses.empty? ? '' : ('^' + guesses.downcase + guesses.upcase), '-')
This may look somewhat similar to tadman's answer.
Probably you should keep the string that represents what you want to hide, instead of what you want to show. Let's say this is remains. Then, it would be easy as:
word.tr(remains.upcase + remains.downcase, "-")

Ruby Koans - Why are repetition operators called "greedy" [duplicate]

What are these two terms in an understandable way?
Greedy will consume as much as possible. From http://www.regular-expressions.info/repeat.html we see the example of trying to match HTML tags with <.+>. Suppose you have the following:
<em>Hello World</em>
You may think that <.+> (. means any non newline character and + means one or more) would only match the <em> and the </em>, when in reality it will be very greedy, and go from the first < to the last >. This means it will match <em>Hello World</em> instead of what you wanted.
Making it lazy (<.+?>) will prevent this. By adding the ? after the +, we tell it to repeat as few times as possible, so the first > it comes across, is where we want to stop the matching.
I'd encourage you to download RegExr, a great tool that will help you explore Regular Expressions - I use it all the time.
'Greedy' means match longest possible string.
'Lazy' means match shortest possible string.
For example, the greedy h.+l matches 'hell' in 'hello' but the lazy h.+?l matches 'hel'.
Greedy quantifier
Lazy quantifier
Description
*
*?
Star Quantifier: 0 or more
+
+?
Plus Quantifier: 1 or more
?
??
Optional Quantifier: 0 or 1
{n}
{n}?
Quantifier: exactly n
{n,}
{n,}?
Quantifier: n or more
{n,m}
{n,m}?
Quantifier: between n and m
Add a ? to a quantifier to make it ungreedy i.e lazy.
Example:
test string : stackoverflow
greedy reg expression : s.*o output: stackoverflow
lazy reg expression : s.*?o output: stackoverflow
Greedy means your expression will match as large a group as possible, lazy means it will match the smallest group possible. For this string:
abcdefghijklmc
and this expression:
a.*c
A greedy match will match the whole string, and a lazy match will match just the first abc.
As far as I know, most regex engine is greedy by default. Add a question mark at the end of quantifier will enable lazy match.
As #Andre S mentioned in comment.
Greedy: Keep searching until condition is not satisfied.
Lazy: Stop searching once condition is satisfied.
Refer to the example below for what is greedy and what is lazy.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String args[]){
String money = "100000000999";
String greedyRegex = "100(0*)";
Pattern pattern = Pattern.compile(greedyRegex);
Matcher matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm greedy and I want " + matcher.group() + " dollars. This is the most I can get.");
}
String lazyRegex = "100(0*?)";
pattern = Pattern.compile(lazyRegex);
matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm too lazy to get so much money, only " + matcher.group() + " dollars is enough for me");
}
}
}
The result is:
I'm greedy and I want 100000000 dollars. This is the most I can get.
I'm too lazy to get so much money, only 100 dollars is enough for me
Taken From www.regular-expressions.info
Greediness: Greedy quantifiers first tries to repeat the token as many times
as possible, and gradually gives up matches as the engine backtracks to find
an overall match.
Laziness: Lazy quantifier first repeats the token as few times as required, and
gradually expands the match as the engine backtracks through the regex to
find an overall match.
From Regular expression
The standard quantifiers in regular
expressions are greedy, meaning they
match as much as they can, only giving
back as necessary to match the
remainder of the regex.
By using a lazy quantifier, the
expression tries the minimal match
first.
Greedy matching. The default behavior of regular expressions is to be greedy. That means it tries to extract as much as possible until it conforms to a pattern even when a smaller part would have been syntactically sufficient.
Example:
import re
text = "<body>Regex Greedy Matching Example </body>"
re.findall('<.*>', text)
#> ['<body>Regex Greedy Matching Example </body>']
Instead of matching till the first occurrence of ‘>’, it extracted the whole string. This is the default greedy or ‘take it all’ behavior of regex.
Lazy matching, on the other hand, ‘takes as little as possible’. This can be effected by adding a ? at the end of the pattern.
Example:
re.findall('<.*?>', text)
#> ['<body>', '</body>']
If you want only the first match to be retrieved, use the search method instead.
re.search('<.*?>', text).group()
#> '<body>'
Source: Python Regex Examples
Greedy Quantifiers are like the IRS
They’ll take as much as they can. e.g. matches with this regex: .*
$50,000
Bye-bye bank balance.
See here for an example: Greedy-example
Non-greedy quantifiers - they take as little as they can
Ask for a tax refund: the IRS sudden becomes non-greedy - and return as little as possible: i.e. they use this quantifier:
(.{2,5}?)([0-9]*) against this input: $50,000
The first group is non-needy and only matches $5 – so I get a $5 refund against the $50,000 input.
See here: Non-greedy-example.
Why do we need greedy vs non-greedy?
It becomes important if you are trying to match certain parts of an expression. Sometimes you don't want to match everything - as little as possible. Sometimes you want to match as much as possible. Nothing more to it.
You can play around with the examples in the links posted above.
(Analogy used to help you remember).
Greedy means it will consume your pattern until there are none of them left and it can look no further.
Lazy will stop as soon as it will encounter the first pattern you requested.
One common example that I often encounter is \s*-\s*? of a regex ([0-9]{2}\s*-\s*?[0-9]{7})
The first \s* is classified as greedy because of * and will look as many white spaces as possible after the digits are encountered and then look for a dash character "-". Where as the second \s*? is lazy because of the present of *? which means that it will look the first white space character and stop right there.
Best shown by example. String. 192.168.1.1 and a greedy regex \b.+\b
You might think this would give you the 1st octet but is actually matches against the whole string. Why? Because the.+ is greedy and a greedy match matches every character in 192.168.1.1 until it reaches the end of the string. This is the important bit! Now it starts to backtrack one character at a time until it finds a match for the 3rd token (\b).
If the string a 4GB text file and 192.168.1.1 was at the start you could easily see how this backtracking would cause an issue.
To make a regex non greedy (lazy) put a question mark after your greedy search e.g
*?
??
+?
What happens now is token 2 (+?) finds a match, regex moves along a character and then tries the next token (\b) rather than token 2 (+?). So it creeps along gingerly.
To give extra clarification on Laziness, here is one example which is maybe not intuitive on first look but explains idea of "gradually expands the match" from Suganthan Madhavan Pillai answer.
input -> some.email#domain.com#
regex -> ^.*?#$
Regex for this input will have a match. At first glance somebody could say LAZY match(".*?#") will stop at first # after which it will check that input string ends("$"). Following this logic someone would conclude there is no match because input string doesn't end after first #.
But as you can see this is not the case, regex will go forward even though we are using non-greedy(lazy mode) search until it hits second # and have a MINIMAL match.
try to understand the following behavior:
var input = "0014.2";
Regex r1 = new Regex("\\d+.{0,1}\\d+");
Regex r2 = new Regex("\\d*.{0,1}\\d*");
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // "0014.2"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // " 0014"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // ""

How to check how many variables (masks) declared in Regexp (ruby)?

Let's say I have a regexp with some arbitrary amount of capturing groups:
pattern = /(some)| ..a lot of masks combined.... |(other)/
Is there any way to determine a number of that groups?
If you can always find a string that matches the regex you are given, then it suffices to match it against the regex, and look at the match data length. However, determining whether a regexp has a string that it matches is np-hard[1]. This is only feasible if you know in advance what kind of regexes you'll be getting.
The next best best method in the Regexp class is Regexp#source or Regexp#to_s. However, we need to parse the regex if we do this.
I can't speak for the future, but as of Ruby 2.0, there is no better method in the Regexp core class.
A left parenthesis denotes a literal left parenthesis, if preceded by an unescaped backslash. A backslash is unescaped unless an unescaped backslash precedes. So, a character is escaped iff preceded by an odd number of backslashes.
An unescaped left parenthesis denotes a capturing group iff not followed by a question mark. With a question mark, it can mean various things: (?'name') and (?<name>) denote a named capturing group. Named and unnamed capturing groups cannot coexist in the same regex, however[2]. (?:) denote non-capturing groups. This is a special case of (?flags-flags:). (?>) denote atomic groups. (?=), (?!), (?<=) and (?<!) denote lookaround. (?#) denote comments.
Ruby regexp engine supports comments in regexes. Considering them in the main regex would be very difficult. We can try to strip them if we really want to support these, but supporting them fully will get messy due to the possibility of inline flags turning extended mode (and thus line comments) on and off in ways that a regular expression cannot capture. I will go ahead and not support unescaped parentheses in regex comments[3].
We want to count:
the number of left parentheses \(
that are not escaped by a backslash (?<!(?<!\\)(?:\\\\)*\\) (read: not preceded by an odd number of backslashes that are not preceded by yet another backslash) and
that are not followed by a question mark (?!\?)
Ruby doesn't support unbounded lookbehind, but if we reverse the source first, we can rewrite the first assertion slightly: (?!(?:\\\\)*(?!\\)). The second assertion becomes a lookbehind: (?<!\?).
the whole solution
def count_groups(regexp)
# named capture support:
# named_count = regexp.named_captures.count
# return named_count if named_count > 0
# main:
test = /(?!<\?)\((?!(?:\\\\)*(?!\\))/
regexp.source.scan(test).count
end
[1]: we can show the NP-hardness by converting the satisfiability problem to it:
AND: xy (x must be an assertion)
OR: x|y
NOT: (?!x)
atoms: (?=1), (?=.1), (?=..1), ..., (?!1), (?!.1)...
example(XOR): /^(?:(?=1)(?!.1)|(?!1)(?=.1))..$/
this extends to NP-completeness for any class of regexes that can be tested in polynomial time. This includes any regex with no nested repetition (or repeated backreferences to repetition or recursion) and with bounded nesting depth of optional matches.
[2]: /((?<name>..)..)../.match('abcdef').to_a returns ['abcdef', 'ab'], indicating that unnamed capturing groups are ignored when named capturing groups are present. Tested in Ruby 1.9.3
[3]: Inline comments start with (?# and end with ). They cannot contain an unescaped right parenthesis, but they can contain an unescaped left parenthesis. These can be stripped easily (even though we have to sprinkle the "unescaped" regex everywhere), are the lesser evil, but they're also less likely to contain anunescaped left parenthesis.
Line comments start with # and end with a newline. These are only treated as comments in the extended mode. Outside the extended mode, they match the literal # and newline. This is still easy, even if we have to consider escaping again. Determining if the regex has the extended flag set is not too difficult, but the flag modifier groups are a different beast entirely.
Even with Ruby's awesome recursive regexes, merely determining if a previously-open group modifying the extended mode is already closed would yield a very nasty regex (even if you replace one by one and don't have to skip comments, you have to account for escaping). It wouldn't be pretty (even with interpolation) and it wouldn't be fast.

What does (?m:\s*) mean in Regex jargon?

What would this mean in an expression?
(?m:.*?)
or this
(?m:\s*)
I mean, it appears to be something to do with whitespace but I'm unsure.
ADDITIONAL DETAILS:
The full expression I'm looking at is:
\A((?m:\s*)((\/\*(?m:.*?)\*\/)|(\#\#\# (?m:.*?)\#\#\#)|(\/\/ .* \n?)+|(\# .* \n?)+))+
(?...) is a way of applying modifiers to the regular expression inside the parentheses.
(?:...) allows you to treat the part between the parentheses as a group, without affecting the set of strings captured by the matching engine. But you can add option letters between the ? and the :, in which case the part of the regular expression between the parentheses behaves as if you had included those option letters when creating the regular expression. That is, /(?m:...)/ behaves the same as /.../m.
The m, in turn, enables "multiline" mode.
CORRECTED:
Here's where I got confused in the original answer, because this option has different meanings in different environments.
This question is tagged Ruby, in which "multiline mode" causes the dot character (.) to match newlines, whereas normally that's the one character it doesn't match:
irb(main):001:0> "a\nb" =~ /a.b/
=> nil
irb(main):002:0> "a\nb" =~ /a.b/m
=> 0
irb(main):003:0> "a\nb" =~ /(?m:a.b)/
=> 0
So your first regular expression, (?m:.*?) will match any number (including zero) of any characters (including newlines). Basically, it will match anything at all, including nothing.
In the second regular expression, (?m:\s*), the modifier has no effect at all because there are no dots in the contained expression to modify.
Back to the first expression. As Ωmega says, the ? after the * means that it is a non-greedy match. If that were the whole expression, or if there were no captures, it wouldn't matter. But when something follows that section and there are captures, you get different results. Without the ?, the longest possible match wins:
irb(main):001:0> /<(.*)>/.match("<a><b>")[1]
=> "a><b"
With the ?, you get the shortest one instead:
irb(main):002:0> /<(.*?)>/.match("<a><b>")[1]
=> "a"
Finally, about the above-mentioned /m confusion (though if you want to avoid becoming confused yourself, this might be a good place to stop reading):
In Perl 5 (which is the source of most regular expression extensions beyond the basic syntax), the behavior triggered by /m in Ruby is instead triggered by the /s option (which Ruby doesn't have, though if you put one on your regex it will silently ignore it). In Perl, /m, despite still being called "multiline mode", has a completely different effect: it causes the ^ and $ anchors to match at newlines within the string as well as at the beginning and end of the whole string respectively. But in Ruby, that behavior is the default, and there's not even an option to change it.
Pattern .*? will match any string, but as short string as possible, as there is a lazy operator ?.
Pattern \s* will match white-space characters (zero of more).
(?m) enables "multi-line mode". In this mode, the caret and dollar match before and after newlines in the subject string. To apply this mode to some sub-pattern only, sytax (?m:...) is used, where ... is a matching pattern.
For more information read http://www.regular-expressions.info/modifiers.html

How to conflate consecutive gsubs in ruby

I have the following
address.gsub(/^\d*/, "").gsub(/\d*-?\d*$/, "").gsub(/\# ?\d*/,"")
Can this be done in one gsub? I would like to pass a list of patterns rather then just one pattern - they are all being replaced by the same thing.
You could combine them with an alternation operator (|):
address = '6 66-666 #99 11-23'
address.gsub(/^\d*|\d*-?\d*$|\# ?\d*/, "")
# " 66-666 "
address = 'pancakes 6 66-666 # pancakes #99 11-23'
address.gsub(/^\d*|\d*-?\d*$|\# ?\d*/,"")
# "pancakes 6 66-666 pancakes "
You might want to add little more whitespace cleanup. And you might want to switch to one of:
/\A\d*|\d*-?\d*\z|\# ?\d*/
/\A\d*|\d*-?\d*\Z|\# ?\d*/
depending on what your data really looks like and how you need to handle newlines.
Combining the regexes is a good idea--and relatively simple--but I'd like to recommend some additional changes. To wit:
address.gsub(/^\d+|\d+(?:-\d+)?$|\# *\d+/, "")
Of your original regexes, ^\d* and \d*-?\d*$ will always match, because they don't have to consume any characters. So you're guaranteed to perform two replacements on every line, even if that's just replacing empty strings with empty strings. Of my regexes, ^\d+ doesn't bother to match unless there's at least one digit at the beginning of the line, and \d+(?:-\d+)?$ matches what looks like an integer-or-range expression at the end of the line.
Your third regex, \# ?\d*, will match any # character, and if the # is followed by a space and some digits, it'll take those as well. Judging by your other regexes and my experience with other questions, I suspect you meant to match a # only if it's followed by one or more digits, with optional spaces intervening. That's what my third regex does.
If any of my guesses are wrong, please describe what you were trying to do, and I'll do my best to come up with the right regex. But I really don't think those first two regexes, at least, are what you want.
EDIT (in answer to the comment): When working with regexes, you should always be aware of the distinction between a regex the matches nothing and a regex that doesn't match. You say you're applying the regexes to street addresses. If an address doesn't happen to start with a house number, ^\d* will match nothing--that is, it will report a successful match, said match consisting of the empty string preceding the first character in the address.
That doesn't matter to you, you're just replacing it with another empty string anyway. But why bother doing the replacement at all? If you change the regex to ^\d+, it will report a failed match and no replacement will be performed. The result is the same either way, but the "matches noting" scenario (^\d*) results in a lot of extra work that the "doesn't match" scenario avoids. In a high-throughput situation, that could be a life-saver.
The other two regexes bring additional complications: \d*-?\d*$ could match a hyphen at the end of the string (e.g. "123-", or even "-"); and \# ?\d* could match a hash symbol anywhere in string, not just as part of an apartment/office number. You know your data, so you probably know neither of those problems will ever arise; I'm just making sure you're aware of them. My regex \d+(?:-\d+)?$ deals with the trailing-hyphen issue, and \# *\d+ at least makes sure there are digits after the hash symbol.
I think that if you combine them together in a single gsub() regex, as an alternation,
it changes the context of the starting search position.
Example, each of these lines start at the beginning of the result of the previous
regex substitution.
s/^\d*//g
s/\d*-?\d*$//g
s/\# ?\d*//g
and this
s/^\d*|\d*-?\d*$|\# ?\d*//g
resumes search/replace where the last match left off and could potentially produce a different overall output, especially since a lot of the subexpressions search for similar
if not the same characters, distinguished only by line anchors.
I think your regex's are unique enough in this case, and of course changing the order
changes the result.

Resources