Confusion with Atomic Grouping - how it differs from the Grouping in regular expression of Ruby? - ruby

I have gone through the docs for Atomic Grouping and rubyinfo and some questions came into my mind:
Why the name "Atomic grouping"? What "atomicity" does it have that general grouping doesn't?
How does atomic grouping differ to general grouping?
Why are atomic groups called non-capturing groups?
I tried the below code to understand but had confusion about the output and how differently they work on the same string as well?
irb(main):001:0> /a(?>bc|b)c/ =~ "abbcdabcc"
=> 5
irb(main):004:0> $~
=> #<MatchData "abcc">
irb(main):005:0> /a(bc|b)c/ =~ "abcdabcc"
=> 0
irb(main):006:0> $~
=> #<MatchData "abc" 1:"b">

A () has some properties (include those such as (?!pattern), (?=pattern), etc. and the plain (pattern)), but the common property between all of them is grouping, which makes the arbitrary pattern a single unit (unit is my own terminology), which is useful in repetition.
The normal capturing (pattern) has the property of capturing and group. Capturing means that the text matches the pattern inside will be captured so that you can use it with back-reference, in matching or replacement. The non-capturing group (?:pattern) doesn't have the capturing property, so it will save a bit of space and speed up a bit compared to (pattern) since it doesn't store the start and end index of the string matching the pattern inside.
Atomic grouping (?>pattern) also has the non-capturing property, so the position of the text matched inside will not be captured.
Atomic grouping adds property of atomic compared to capturing or non-capturing group. Atomic here means: at the current position, find the first sequence (first is defined by how the engine matches according to the pattern given) that matches the pattern inside atomic grouping and hold on to it (so backtracking is disallowed).
A group without atomicity will allow backtracking - it will still find the first sequence, then if the matching ahead fails, it will backtrack and find the next sequence, until a match for the entire regex expression is found or all possibilities are exhausted.
Example
Input string: bbabbbabbbbc
Pattern: /(?>.*)c/
The first match by .* is bbabbbabbbbc due to the greedy quantifier *. It will hold on to this match, disallowing c from matching. The matcher will retry at the next position to the end of the string, and the same thing happens. So nothing matches the regex at all.
Input string: bbabbbabbbbc
Pattern: /((?>.*)|b*)[ac]/, for testing /(((?>.*))|(b*))[ac]/
There are 3 matches to this regex, which are bba, bbba, bbbbc. If you use the 2nd regex, which is the same but with capturing groups added for debugging purpose, you can see that all the matches are result of matching b* inside.
You can see the backtracking behavior here.
Without the atomic grouping /(.*|b*)[ac]/, the string will have a single match which is the whole string, due to backtracking at the end to match [ac]. Note that the engine will go back to .* to backtrack by 1 character since it still have other possibilities.
Pattern: /(.*|b*)[ac]/
bbabbbabbbbc
^ -- Start matching. Look at first item in alternation: .*
bbabbbabbbbc
^ -- First match of .*, due to greedy quantifier
bbabbbabbbbc
X -- [ac] cannot match
-- Backtrack to ()
bbabbbabbbbc
^ -- Continue explore other possibility with .*
-- Step back 1 character
bbabbbabbbbc
^ -- [ac] matches, end of regex, a match is found
With the atomic grouping, all possibilities of .* is cut off and limited to the first match. So after greedily eating the whole string and fail to match, the engine have to go for the b* pattern, where it successfully finds a match to the regex.
Pattern: /((?>.*)|b*)[ac]/
bbabbbabbbbc
^ -- Start matching. Look at first item in alternation: (?>.*)
bbabbbabbbbc
^ -- First match of .*, due to greedy quantifier
-- The atomic grouping will disallow .* to be backtracked and rematched
bbabbbabbbbc
X -- [ac] cannot match
-- Backtrack to ()
-- (?>.*) is atomic, check the next possibility by alternation: b*
bbabbbabbbbc
^ -- Starting to rematch with b*
bbabbbabbbbc
^ -- First match with b*, due to greedy quantifier
bbabbbabbbbc
^ -- [ac] matches, end of regex, a match is found
The subsequent matches will continue on from here.

I recently had to explain Atomic Groups to someone else and I thought I'd tweak and share the example here.
Consider /the (big|small|biggest) (cat|dog|bird)/
Matches in bold
the big dog
the small bird
the biggest dog
the small cat
DEMO
For the first line, a regex engine would find the .
It would then proceed on to our adjectives (big, small, biggest), it finds big.
Having matched big, it proceeds and finds the space.
It then looks at our pets (cat, dog, bird), finds cat, skips it, and finds dog.
For the second line, our regex would find the .
It would proceed and look at big, skip it, look at and find small.
It finds the space, skips cat and dog because they don't match, and finds bird.
For the third line, our regex would find the ,
It continues on and finds big which matches the immediate requirement, and proceeds.
It can't find the space, so it backtracks (rewinds the position to the last choice it made).
It skips big, skips small, and finds biggest which also matches the immediate requirement.
It then finds the space.
It skips cat , and matches dog.
For the fourth line, our regex would find the .
It would proceed to look at big, skip it, look at and find small.
It then finds the space.
It looks at and matches cat.
Consider /the (?>big|small|biggest) (cat|dog|bird)/
Note the ?> atomic group on adjectives.
Matches in bold
the big dog
the small bird
the biggest dog
the small cat
DEMO
For the first line, second line, and fourth line, we'll get the same result.
For the third line, our regex would find the ,
It continues on and find big which matches the immediate requirement, and proceeds.
It can't find the space, but the atomic group, being the last choice the engine made, won't allow that choice to be re-examined (prohibits backtracking).
Since it can't make a new choice, the match has to fail, since our simple expression has no other choices.
This is only a basic summary. An engine wouldn't need to look at the entirety of cat to know that it doesn't match dog, merely looking at the c is enough. When trying to match bird, the c in cat and the d in dog are enough to tell the engine to examine other options.
However if you had ...((cat|snake)|dog|bird), the engine would also, of course, need to examine snake before it dropped to the previous group and examined dog and bird.
There are also plenty of choices an engine can't decide without going past what may not seem like a match, which is what results in backtracking. If you have ((red)?cat|dog|bird), The engine will look at r, back out, notice the ? quantifier, ignore the subgroup (red), and look for a match.

An "atomic group" is one where the regular expression will never backtrack past. So in your first example /a(?>bc|b)c/ if the bc alternation in the group matches, then it will never backtrack out of that and try the b alternation. If you slightly alter your first example to match against "abcdabcc" then you'll see it still matches the "abcc" at the end of the string instead of the "abc" at the start. If you don't use an atomic group, then it can backtrack past the bc and try the b alternation and end up matching the "abc" at the start.
As for question two, how it's different, that's just a rephrasing of your first question.
And lastly, atomic groups are not "called" non-capturing groups. That's not an alternate name for them. Non-capturing groups are groups that do not capture their content. Typically when you match a regular expression against a string, you can retrieve all the matched groups, and if you use a substitution, you can use backreferences in the substitution like \1 to insert the captured groups there. But a non-capturing group does not provide this. The classic non-capturing group is (?:pattern). An atomic group happens to also have the non-capturing property, hence why it's called a non-capturing group.

Related

Inconsistency between engines when using reluctant quantifier in negative look ahead

I found something odd when using a reluctant quantifier in a negative look ahead.
When creating a regex to assert a maximum of 3 uppercase characters, I devised this:
^(?!(.*?[A-Z]){4}).*$
which works on rubular, but not on regex101.
Why is that?
^, $ matches beginning/end of line in Ruby.
While in another languages, ^, $ matches the beginning/end of the string unless multiline mode (m) is specified. (Some regular expression engine requires g flag to match multiple times.)

Can someone give me an example of regular expressions using {x} and {x,y}?

I just learned from a book about regular expressions in the Ruby language. I did Google it, but still got confused about {x} and {x,y}.
The book says:
{x}→Match x occurrences of the preceding character.
{x,y}→Match at least x occurrences and at most y occurrences.
Can anyone explain this better, or provide some examples?
Sure, look at these examples:
http://rubular.com/r/sARHv0vf72
http://rubular.com/r/730Zo6rIls
/a{4}/
is the short version for:
/aaaa/
It says: Match exact 4 (consecutive) characters of 'a'.
where
/a{2,4}/
says: Match at least 2, and at most 4 characters of 'a'.
it will match
/aa/
/aaa/
/aaaa/
and it won't match
/a/
/aaaaa/
/xxx/
Limiting Repetition good online tutorial for this.
I highly recommend regexbuddy.com and very briefly, the regex below does what you refer to:
[0-9]{3}|\w{3}
The [ ] characters indicate that you must match a number between 0 and 9. It can be anything, but the [ ] is literal match. The { } with a 3 inside means match sets of 3 numbers between 0 and 9. The | is an or statement. The \w, is short hand for any word character and once again the {3} returns only sets of 3.
If you go to RegexPal.com you can enter the code above and test it. I used the following data to test the expression:
909 steve kinzey
and the expression matched the 909, the 'ste', the 'kin' and the 'zey'. It did not match the 've' because it is only 2 word characters long and a word character does not span white space so it could not carry over to the second word.
Interval Expressions
GNU awk refers to these as "interval expressions" in the Regexp Operators section of its manual. It explains the expressions as follows:
{n}
{n,}
{n,m}
One or two numbers inside braces denote an interval expression. If there is one number in the braces, the preceding regexp is repeated n times. If there are two numbers separated by a comma, the preceding regexp is repeated n to m times. If there is one number followed by a comma, then the preceding regexp is repeated at least n times:
The manual also includes these reference examples:
wh{3}y
Matches ‘whhhy’, but not ‘why’ or ‘whhhhy’.
wh{3,5}y
Matches ‘whhhy’, ‘whhhhy’, or ‘whhhhhy’, only.
wh{2,}y
Matches ‘whhy’ or ‘whhhy’, and so on.
See Also
Ruby's Regexp class.
Quantifiers section of Ruby's oniguruma engine.

matching single letters in a sentence with a regular expression

I want to match single letters in a sentence. So in ...
I want to have my turkey. May I. I 20,000-t bar-b-q
I'd like to match
*I* want to have my turkey. May *I*. *I* 20,000-t bar-b-q
right now I'm using
/\b\w\b/
as my regular expression, but that is matching
*I* want to have my turkey. May *I*. *I* 20,000-*t* bar-*b*-*q*
Any suggestions on how to get past that last mile?
Use a negative lookbehind and negative lookahead to fail if the previous character is a word or a hyphen, or if the next character is a word a or a hyphen:
/(?<![\w\-])\w(?![\w\-])/
Example: http://www.rubular.com/r/9upmgfG9u4
Note that as mentioned by rtcherry, this will also match single numbers. To prevent this you may want to change the \w that is outside of the character classes to [a-zA-Z].
F.J's answer will also include numbers. This is restricted to ASCII characters, but you really need to define what characters can be side by side an still count as a single letter.
/(?<![0-9a-zA-Z\-])[a-zA-Z](?![0-9a-zA-Z\-])/
That will also avoid things like This -> 1a <- is not a single letter. Neither is -> 2 <- that.
As long as we're being picky, non-ASCII letters are easy to include:
/(?<![[:alnum:]-])[[:alpha:]](?![[:alnum:]-])/
This will avoid matching the t in 'Cómo eres tú'
Notice that it's not necessary to escape the - when it is the last character in a character class (which I'm not sure that this technically is).
You are asking far too much of a regular expression. \w matches a word character, which includes upper and lower case alphabetics, the ten digits, and underscore. So it is the same as [0-9A-Z_a-z].
\b matches the (zero-width) boundary where a word character doesn't have another word character next to it, for instance at the beginning or end of a string, or next to some punctuation or white space.
Using negative look-behinds and look-aheads, this amounts to \b\w\b being equivalent to
(?<!\w)\w(?!\w)
i.e. a word character that doesn't have another word character before or after it.
As you have found, that finds t, b and q in 20,000-t bar-b-q. So it's back in your court to define what you really mean by "single letters in a sentence".
It nearly works to say "any letter that isn't preceded or followed by a printable character, which is
/(?<!\S)[A-Za-z](?!\S)/
But that leaves out I in May I. because it has a dot after it.
So, do you mean a single letter that isn't preceded by a printable character, and is followed by whitespace, a dot, or the end of the string (or a comma, a semicolon or a colon for good measure)? Then you want
/(?<!\S)[A-Za-z](?=(?:[\s.,;:]|\z))/
which finds exactly three I characters in your string.
I hope that helps.

How does backtracking differ from back-referencing in regular expressions?

How does backtracking differ from back-referencing in regular expressions?
How does back-referencing win limitation having with backtracking or vice-versa?
Backtracking is a way for a state machine to back up and retry other matches for a regular expression. It's something that's pretty much internal to the regex engine.
For example, say you're trying to match the regex [a-z]*a, any number of lower case characters followed by an a.
Given the input abca, a greedy match will assign all of that to the [a-z] portion of the regex but then there's no way to match the final a. Backtracking allows the engine to back up by returning that final a to the input stream and trying again, assigning abc to the [a-z] portion and a to the a portion.
Back-referencing on the other hand, is a means for a user of the regex engine to reference previously captured groups. For example,
s/^([a-z])([a-z])/\1_\2/
\_____/\_____/
| |
| +- capture group 2
+-------- capture group 1
may be a command to insert _ between two consecutive lower case letters at the start of each line. The \N back-reference (where N represents a number) refers back to the groups captured within ().

Treetop backtracking similar to regex?

Everything I've read suggests Treetop backtracks like regular expressions, but I'm having a hard time making that work.
Suppose I have the following grammar:
grammar TestGrammar
rule open_close
'{' .+ '}'
end
end
This does not match the string {abc}. I suspect that's because the .+ is consuming everything from the letter a onwards. I.e. it's consuming abc} when I only want it to consume abc.
This appears different from what a similar regex does. The regex /{.+}/ will match {abc}. It's my understanding that this is possible because the regex engine backtracks after consuming the closing } as part of the .+ and then failing to match.
So can Treetop do backtracking like that? If so, how?
I know you can use negation to match "anything other than a }." But that's not my intention. Suppose I want to be able to match the string {ab}c}. The tokens I want in that case are the opening {, a middle string of ab}c, and the closing }. This is a contrived example, but it becomes very relevant when working with nested expressions like {a b {c d}}.
Treetop is an implementation of a Parsing Expression Grammar parser. One of the benefits of PEGs is their combination of flexibility, speed, and memory requirements. However, this balancing act has some tradeoffs.
Quoting from the Wikipedia article:
The zero-or-more, one-or-more, and optional operators consume zero or more, one or more, or zero or one consecutive repetitions of their sub-expression e, respectively. Unlike in context-free grammars and regular expressions, however, these operators always behave greedily, consuming as much input as possible and never backtracking. […] the expression (a* a) will always fail because the first part (a*) will never leave any a's for the second part to match.
(Emphasis mine.)
In short: while certain PEG operators can backtrack in an attempt to take another route, the + operator cannot.
Instead, in order to match nested sub-expressions, you want to create an alternation between the delimited sub-expression (checked first) followed by the non-expression characters. Something like (untested):
grammar TestGrammar
rule open_close
'{' contents '}'
end
rule contents
open_close / non_brackets
end
rule non_brackets
# …
end
end

Resources