Ruby regular expression utilizing OR within a character class - ruby

While going through the ruby-doc for regular expressions, I came across this example for implementing the && operator:
/[a-w&&[^c-g]z]/ # ([a-w] AND ([^c-g] OR z))
# This is equivalent to:
/[abh-w]/
I understand that
/[a-w&&[^c-g]]/
would equate to
/[abh-w]/
because the "^" denotes symbols that should be excluded from the regular expression.
However, I am wondering about why "z" is not also included? Why was the equivalent regular expression NOT:
/[abh-wz]/
I am very new to regular expressions, much less any specifics for regular expressions within Ruby, so any help is greatly appreciated!

The page explicitly says:
/[a-w&&[^c-g]z]/ # ([a-w] AND ([^c-g] OR z))
# This is equivalent to:
/[abh-w]/
"z" is not included in the left "AND" term, so it can't be matched.
See: "All things that are both apples, and also either apples or pears, at the same time" does not include pears. Only apples are both apples and (apples or pears). Likewise, a is in both a-w and [^c-g]z, so it matches; z is not in the left side, so "AND" is not satisfied, thus the whole expression fails.

Related

What does /anystring/ mean in ruby?

I came across this: /sera/ === coursera. What does /sera/ mean? Please tell me. I do not understand the meaning of the expression above.
It's a regular expression. The more formal version of same is this:
coursera.match(/sera/)
Or:
/sera/.match(coursera)
These are both functionally similar. Either a string matches a regular expression, or a regular expression can be tested for matches against a string.
The long explanation of your original code is: Are the characters sera can be found in the variable coursera?
If you do this:
"coursera".match(/sera/)
# => #<MatchData "sera">
You get a MatchData result which means it matched. For more complicated expressions you can capture parts of the string using arbitrary patterns and so on. The general rule here is regular expressions in Ruby look like /.../ or vaguely like %r[...] in form.
You may also see the =~ operator used which is something Ruby inherited from Perl. It also means match.

Precedence of Ruby regular expressions?

I am reviewing regular expressions and cannot understand why a regular expression won't match a given string, specifically:
regex = /(ab*)+(bc)?/
mystring = "abbc"
The match matches "abb" but leaves the c off. I tested this using Rubular and in IRB and don't understand why the regex doesn't match the entire string. I thought that (ab*)+ would match "ab" and then (bc)? would match "bc".
Am I missing something in terms of precedence for regular expression operations?
Regular expressions try to match the first part of the regular expression as much as possible by default, and they do not backtrack to try to make larger sections match if they don't have to. Since you make (bc) optional, the (ab*) can match as much as it wants (the non-zero repetition after it doesn't have much to do) and doesn't try backtracking to try other matching alternatives.
If you want the whole string to be matched (which will force some backtracking in this case) make sure you anchor both ends of the string:
regex = /^(ab*)+(bc)?$/
The regex with parenthesis assumes you have two matches in your string.
The first one is abb because (ab*) means a and zero or more b. You have two b, so the match is abb. Then you have only c in your string, so it doesn't match the second condition which is bc.

How does this Ruby code work - if-stmt with ranges ?

I'm currently learning Ruby and I can't seem to wrap around what if /start/../end does... Help?
while gets
print if /start/../end/
end
Since you mentioned that you're new to Ruby, it's first worth taking note that you're dealing with Regular Expressions (regex) in the example - anything that is delimited between two forward slashes:
/start/ # a regular expression literal
Regular Expressions are a powerful way of matching a certain combination of letters from a larger string.
"To start means to begin." =~ /start/ #=> true, because 'start' is in the string.
The double dot notation is the flip-flop operator, a controversial construct probably inherited from Perl and not usually recommended to be used because it can lead to confusion.
It means the following:
It will collectively evaluate to false until the left hand operand is true. At which point it will collectively evaluate to true. However it will only remain true until the right hand operand evaluates to true - at which point it will again evaluate collectively to false.
Using your above example therefore:
while gets
print if /start/../end/
end
Until 'start' is entered in, the entire expression is false, and nothing is printed.
When 'start' is input, the entire expression is true, therefore EVERYTHING input after this point will also be printed out. (despite not being 'start')
As soon as 'end' is input, the entire expression evaluates to false, and nothing from that point on is printed out.
It's called the flip-flop operator. You can read more at "Ruby flip-flop operator".

What does (?m:\s*) mean in Regex jargon?

What would this mean in an expression?
(?m:.*?)
or this
(?m:\s*)
I mean, it appears to be something to do with whitespace but I'm unsure.
ADDITIONAL DETAILS:
The full expression I'm looking at is:
\A((?m:\s*)((\/\*(?m:.*?)\*\/)|(\#\#\# (?m:.*?)\#\#\#)|(\/\/ .* \n?)+|(\# .* \n?)+))+
(?...) is a way of applying modifiers to the regular expression inside the parentheses.
(?:...) allows you to treat the part between the parentheses as a group, without affecting the set of strings captured by the matching engine. But you can add option letters between the ? and the :, in which case the part of the regular expression between the parentheses behaves as if you had included those option letters when creating the regular expression. That is, /(?m:...)/ behaves the same as /.../m.
The m, in turn, enables "multiline" mode.
CORRECTED:
Here's where I got confused in the original answer, because this option has different meanings in different environments.
This question is tagged Ruby, in which "multiline mode" causes the dot character (.) to match newlines, whereas normally that's the one character it doesn't match:
irb(main):001:0> "a\nb" =~ /a.b/
=> nil
irb(main):002:0> "a\nb" =~ /a.b/m
=> 0
irb(main):003:0> "a\nb" =~ /(?m:a.b)/
=> 0
So your first regular expression, (?m:.*?) will match any number (including zero) of any characters (including newlines). Basically, it will match anything at all, including nothing.
In the second regular expression, (?m:\s*), the modifier has no effect at all because there are no dots in the contained expression to modify.
Back to the first expression. As Ωmega says, the ? after the * means that it is a non-greedy match. If that were the whole expression, or if there were no captures, it wouldn't matter. But when something follows that section and there are captures, you get different results. Without the ?, the longest possible match wins:
irb(main):001:0> /<(.*)>/.match("<a><b>")[1]
=> "a><b"
With the ?, you get the shortest one instead:
irb(main):002:0> /<(.*?)>/.match("<a><b>")[1]
=> "a"
Finally, about the above-mentioned /m confusion (though if you want to avoid becoming confused yourself, this might be a good place to stop reading):
In Perl 5 (which is the source of most regular expression extensions beyond the basic syntax), the behavior triggered by /m in Ruby is instead triggered by the /s option (which Ruby doesn't have, though if you put one on your regex it will silently ignore it). In Perl, /m, despite still being called "multiline mode", has a completely different effect: it causes the ^ and $ anchors to match at newlines within the string as well as at the beginning and end of the whole string respectively. But in Ruby, that behavior is the default, and there's not even an option to change it.
Pattern .*? will match any string, but as short string as possible, as there is a lazy operator ?.
Pattern \s* will match white-space characters (zero of more).
(?m) enables "multi-line mode". In this mode, the caret and dollar match before and after newlines in the subject string. To apply this mode to some sub-pattern only, sytax (?m:...) is used, where ... is a matching pattern.
For more information read http://www.regular-expressions.info/modifiers.html

How do I make part of a regular expression optional in Ruby?

To match the following:
On Mar 3, 2011 11:05 AM, "mr person"
wrote:
I have the following regular expression:
/(On.* (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{1,2}, [12]\d{3}.* at \d{1,2}:\d{1,2} (?:AM|PM),.*wrote:)/m
Is there a way to make the at optional? so if it's there great, if not, it still matches?
Sure. Put it in parentheses, put a question mark after it. Include one of the spaces (since otherwise you'll be trying to match two spaces if the "at" is missing.) (at )? (or as someone else suggested, (?:at )? to avoid it being captured).
Don't forget (?:) to make sure the bracketed expression doesn't get captured
(?:at)?
Sure, you just need to group the optional part...
(at )*
And, ok, I guess that will match at at at at, so you might want to just do:
(at )?
Others got your answer. This is just an aside re: Regular Expressions.
When you say "conditions" in regular expressions, it refers to the regex language. Like any language, its a branch in code execution, but the code is a different regular expression path, the "code" of regular expressions.
So in psudo code: if (evaluation is true) do this regular sub-expression, else do this other sub-expression.
This conditional exists in advanced regular expression engines ... Perl.
Perl uses the most advanced regular expression engine that exists. In version 6 and beyond it will be an integral part of the language, where code and expression intermingle seamlessly.
Perl 5.10 has this construct:
(?(condition)yes-pattern|no-pattern).
Edit Just a warning that where Perl goes, every other language follows as far as regular expression.

Resources