Treetop backtracking similar to regex? - ruby

Everything I've read suggests Treetop backtracks like regular expressions, but I'm having a hard time making that work.
Suppose I have the following grammar:
grammar TestGrammar
rule open_close
'{' .+ '}'
end
end
This does not match the string {abc}. I suspect that's because the .+ is consuming everything from the letter a onwards. I.e. it's consuming abc} when I only want it to consume abc.
This appears different from what a similar regex does. The regex /{.+}/ will match {abc}. It's my understanding that this is possible because the regex engine backtracks after consuming the closing } as part of the .+ and then failing to match.
So can Treetop do backtracking like that? If so, how?
I know you can use negation to match "anything other than a }." But that's not my intention. Suppose I want to be able to match the string {ab}c}. The tokens I want in that case are the opening {, a middle string of ab}c, and the closing }. This is a contrived example, but it becomes very relevant when working with nested expressions like {a b {c d}}.

Treetop is an implementation of a Parsing Expression Grammar parser. One of the benefits of PEGs is their combination of flexibility, speed, and memory requirements. However, this balancing act has some tradeoffs.
Quoting from the Wikipedia article:
The zero-or-more, one-or-more, and optional operators consume zero or more, one or more, or zero or one consecutive repetitions of their sub-expression e, respectively. Unlike in context-free grammars and regular expressions, however, these operators always behave greedily, consuming as much input as possible and never backtracking. […] the expression (a* a) will always fail because the first part (a*) will never leave any a's for the second part to match.
(Emphasis mine.)
In short: while certain PEG operators can backtrack in an attempt to take another route, the + operator cannot.
Instead, in order to match nested sub-expressions, you want to create an alternation between the delimited sub-expression (checked first) followed by the non-expression characters. Something like (untested):
grammar TestGrammar
rule open_close
'{' contents '}'
end
rule contents
open_close / non_brackets
end
rule non_brackets
# …
end
end

Related

Regix in ruby on rails for adding "\n\n" " 2 new line" when i found "\n" [duplicate]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
This question's answers are a community effort. Edit existing answers to improve this post. It is not currently accepting new answers or interactions.
I don't really understand regular expressions. Can you explain them to me in an easy-to-follow manner? If there are any online tools or books, could you also link to them?
The most important part is the concepts. Once you understand how the building blocks work, differences in syntax amount to little more than mild dialects. A layer on top of your regular expression engine's syntax is the syntax of the programming language you're using. Languages such as Perl remove most of this complication, but you'll have to keep in mind other considerations if you're using regular expressions in a C program.
If you think of regular expressions as building blocks that you can mix and match as you please, it helps you learn how to write and debug your own patterns but also how to understand patterns written by others.
Start simple
Conceptually, the simplest regular expressions are literal characters. The pattern N matches the character 'N'.
Regular expressions next to each other match sequences. For example, the pattern Nick matches the sequence 'N' followed by 'i' followed by 'c' followed by 'k'.
If you've ever used grep on Unix—even if only to search for ordinary looking strings—you've already been using regular expressions! (The re in grep refers to regular expressions.)
Order from the menu
Adding just a little complexity, you can match either 'Nick' or 'nick' with the pattern [Nn]ick. The part in square brackets is a character class, which means it matches exactly one of the enclosed characters. You can also use ranges in character classes, so [a-c] matches either 'a' or 'b' or 'c'.
The pattern . is special: rather than matching a literal dot only, it matches any character†. It's the same conceptually as the really big character class [-.?+%$A-Za-z0-9...].
Think of character classes as menus: pick just one.
Helpful shortcuts
Using . can save you lots of typing, and there are other shortcuts for common patterns. Say you want to match a digit: one way to write that is [0-9]. Digits are a frequent match target, so you could instead use the shortcut \d. Others are \s (whitespace) and \w (word characters: alphanumerics or underscore).
The uppercased variants are their complements, so \S matches any non-whitespace character, for example.
Once is not enough
From there, you can repeat parts of your pattern with quantifiers. For example, the pattern ab?c matches 'abc' or 'ac' because the ? quantifier makes the subpattern it modifies optional. Other quantifiers are
* (zero or more times)
+ (one or more times)
{n} (exactly n times)
{n,} (at least n times)
{n,m} (at least n times but no more than m times)
Putting some of these blocks together, the pattern [Nn]*ick matches all of
ick
Nick
nick
Nnick
nNick
nnick
(and so on)
The first match demonstrates an important lesson: * always succeeds! Any pattern can match zero times.
A few other useful examples:
[0-9]+ (and its equivalent \d+) matches any non-negative integer
\d{4}-\d{2}-\d{2} matches dates formatted like 2019-01-01
Grouping
A quantifier modifies the pattern to its immediate left. You might expect 0abc+0 to match '0abc0', '0abcabc0', and so forth, but the pattern immediately to the left of the plus quantifier is c. This means 0abc+0 matches '0abc0', '0abcc0', '0abccc0', and so on.
To match one or more sequences of 'abc' with zeros on the ends, use 0(abc)+0. The parentheses denote a subpattern that can be quantified as a unit. It's also common for regular expression engines to save or "capture" the portion of the input text that matches a parenthesized group. Extracting bits this way is much more flexible and less error-prone than counting indices and substr.
Alternation
Earlier, we saw one way to match either 'Nick' or 'nick'. Another is with alternation as in Nick|nick. Remember that alternation includes everything to its left and everything to its right. Use grouping parentheses to limit the scope of |, e.g., (Nick|nick).
For another example, you could equivalently write [a-c] as a|b|c, but this is likely to be suboptimal because many implementations assume alternatives will have lengths greater than 1.
Escaping
Although some characters match themselves, others have special meanings. The pattern \d+ doesn't match backslash followed by lowercase D followed by a plus sign: to get that, we'd use \\d\+. A backslash removes the special meaning from the following character.
Greediness
Regular expression quantifiers are greedy. This means they match as much text as they possibly can while allowing the entire pattern to match successfully.
For example, say the input is
"Hello," she said, "How are you?"
You might expect ".+" to match only 'Hello,' and will then be surprised when you see that it matched from 'Hello' all the way through 'you?'.
To switch from greedy to what you might think of as cautious, add an extra ? to the quantifier. Now you understand how \((.+?)\), the example from your question works. It matches the sequence of a literal left-parenthesis, followed by one or more characters, and terminated by a right-parenthesis.
If your input is '(123) (456)', then the first capture will be '123'. Non-greedy quantifiers want to allow the rest of the pattern to start matching as soon as possible.
(As to your confusion, I don't know of any regular-expression dialect where ((.+?)) would do the same thing. I suspect something got lost in transmission somewhere along the way.)
Anchors
Use the special pattern ^ to match only at the beginning of your input and $ to match only at the end. Making "bookends" with your patterns where you say, "I know what's at the front and back, but give me everything between" is a useful technique.
Say you want to match comments of the form
-- This is a comment --
you'd write ^--\s+(.+)\s+--$.
Build your own
Regular expressions are recursive, so now that you understand these basic rules, you can combine them however you like.
Tools for writing and debugging regexes:
RegExr (for JavaScript)
Perl: YAPE: Regex Explain
Regex Coach (engine backed by CL-PPCRE)
RegexPal (for JavaScript)
Regular Expressions Online Tester
Regex Buddy
Regex 101 (for PCRE, JavaScript, Python, Golang, Java 8)
I Hate Regex
Visual RegExp
Expresso (for .NET)
Rubular (for Ruby)
Regular Expression Library (Predefined Regexes for common scenarios)
Txt2RE
Regex Tester (for JavaScript)
Regex Storm (for .NET)
Debuggex (visual regex tester and helper)
Books
Mastering Regular Expressions, the 2nd Edition, and the 3rd edition.
Regular Expressions Cheat Sheet
Regex Cookbook
Teach Yourself Regular Expressions
Free resources
RegexOne - Learn with simple, interactive exercises.
Regular Expressions - Everything you should know (PDF Series)
Regex Syntax Summary
How Regexes Work
JavaScript Regular Expressions
Footnote
†: The statement above that . matches any character is a simplification for pedagogical purposes that is not strictly true. Dot matches any character except newline, "\n", but in practice you rarely expect a pattern such as .+ to cross a newline boundary. Perl regexes have a /s switch and Java Pattern.DOTALL, for example, to make . match any character at all. For languages that don't have such a feature, you can use something like [\s\S] to match "any whitespace or any non-whitespace", in other words anything.

Is there a method to find the most specific pattern for a string?

I'm wondering whether there is a way to generate the most specific regular expression (if such a thing exists) that matches a given string. Here's an illustration of what I want the method to do:
str = "(17 + 31)"
find_pattern(str)
# => /^\(\d+ \+ \d+\)$/ (or something more specific)
My intuition was to use Regex.new to accumulate the desired pattern by looping through str and checking for known patterns like \d, \s, and so on. I suspect there is an easy way for doing this.
This is in essence an algorithm compression problem. The simplest way to match a list of known strings is to use Regexp.union factory method, but that just tries each string in turn, it does not do anything "clever":
combined_rx = Regexp.union( "(17 + 31)", "(17 + 45)" )
=> /\(17\ \+\ 31\)|\(17\ \+\ 45\)/
This can still be useful to construct multi-stage validators, without you needing to write loops to check them all.
However, a generic pattern matcher that could figure out what you mean to match from examples is not really possible. There are too many ways in which you could consider strings to be similar or not. The closest I could think of would be genetic programming where you supply a large list of should match/should not match strings and the code guesses at the best regex by constructing random Regexp objects (a challenge in itself) and seeing how accurately they match and don't match your examples. The best matchers could be combined and mutated and tried again until you got 100% accuracy. This might be a fun project, but ultimately much more effort for most purposes than writing the regular expressions yourself from a description of the problem.
If your problem is heavily constrained - e.g. any example integer could always be replaced by \d+, any example space by \s+ etc, then you could work through the string replacing "matchable units", in fact using the same regular expressions checked in turn. E.g. if you match \A\d+ then consume the match from the string, and add \d+ to your regex. Then take the remainder of the string and look for next matching pattern. Working this way will have its limitations (you must know the full set of patterns you want to match in advance, and all examples would have to be unambiguous). However, it is more tractable than a genetic program.

Inconsistency between engines when using reluctant quantifier in negative look ahead

I found something odd when using a reluctant quantifier in a negative look ahead.
When creating a regex to assert a maximum of 3 uppercase characters, I devised this:
^(?!(.*?[A-Z]){4}).*$
which works on rubular, but not on regex101.
Why is that?
^, $ matches beginning/end of line in Ruby.
While in another languages, ^, $ matches the beginning/end of the string unless multiline mode (m) is specified. (Some regular expression engine requires g flag to match multiple times.)

Confusion with Atomic Grouping - how it differs from the Grouping in regular expression of Ruby?

I have gone through the docs for Atomic Grouping and rubyinfo and some questions came into my mind:
Why the name "Atomic grouping"? What "atomicity" does it have that general grouping doesn't?
How does atomic grouping differ to general grouping?
Why are atomic groups called non-capturing groups?
I tried the below code to understand but had confusion about the output and how differently they work on the same string as well?
irb(main):001:0> /a(?>bc|b)c/ =~ "abbcdabcc"
=> 5
irb(main):004:0> $~
=> #<MatchData "abcc">
irb(main):005:0> /a(bc|b)c/ =~ "abcdabcc"
=> 0
irb(main):006:0> $~
=> #<MatchData "abc" 1:"b">
A () has some properties (include those such as (?!pattern), (?=pattern), etc. and the plain (pattern)), but the common property between all of them is grouping, which makes the arbitrary pattern a single unit (unit is my own terminology), which is useful in repetition.
The normal capturing (pattern) has the property of capturing and group. Capturing means that the text matches the pattern inside will be captured so that you can use it with back-reference, in matching or replacement. The non-capturing group (?:pattern) doesn't have the capturing property, so it will save a bit of space and speed up a bit compared to (pattern) since it doesn't store the start and end index of the string matching the pattern inside.
Atomic grouping (?>pattern) also has the non-capturing property, so the position of the text matched inside will not be captured.
Atomic grouping adds property of atomic compared to capturing or non-capturing group. Atomic here means: at the current position, find the first sequence (first is defined by how the engine matches according to the pattern given) that matches the pattern inside atomic grouping and hold on to it (so backtracking is disallowed).
A group without atomicity will allow backtracking - it will still find the first sequence, then if the matching ahead fails, it will backtrack and find the next sequence, until a match for the entire regex expression is found or all possibilities are exhausted.
Example
Input string: bbabbbabbbbc
Pattern: /(?>.*)c/
The first match by .* is bbabbbabbbbc due to the greedy quantifier *. It will hold on to this match, disallowing c from matching. The matcher will retry at the next position to the end of the string, and the same thing happens. So nothing matches the regex at all.
Input string: bbabbbabbbbc
Pattern: /((?>.*)|b*)[ac]/, for testing /(((?>.*))|(b*))[ac]/
There are 3 matches to this regex, which are bba, bbba, bbbbc. If you use the 2nd regex, which is the same but with capturing groups added for debugging purpose, you can see that all the matches are result of matching b* inside.
You can see the backtracking behavior here.
Without the atomic grouping /(.*|b*)[ac]/, the string will have a single match which is the whole string, due to backtracking at the end to match [ac]. Note that the engine will go back to .* to backtrack by 1 character since it still have other possibilities.
Pattern: /(.*|b*)[ac]/
bbabbbabbbbc
^ -- Start matching. Look at first item in alternation: .*
bbabbbabbbbc
^ -- First match of .*, due to greedy quantifier
bbabbbabbbbc
X -- [ac] cannot match
-- Backtrack to ()
bbabbbabbbbc
^ -- Continue explore other possibility with .*
-- Step back 1 character
bbabbbabbbbc
^ -- [ac] matches, end of regex, a match is found
With the atomic grouping, all possibilities of .* is cut off and limited to the first match. So after greedily eating the whole string and fail to match, the engine have to go for the b* pattern, where it successfully finds a match to the regex.
Pattern: /((?>.*)|b*)[ac]/
bbabbbabbbbc
^ -- Start matching. Look at first item in alternation: (?>.*)
bbabbbabbbbc
^ -- First match of .*, due to greedy quantifier
-- The atomic grouping will disallow .* to be backtracked and rematched
bbabbbabbbbc
X -- [ac] cannot match
-- Backtrack to ()
-- (?>.*) is atomic, check the next possibility by alternation: b*
bbabbbabbbbc
^ -- Starting to rematch with b*
bbabbbabbbbc
^ -- First match with b*, due to greedy quantifier
bbabbbabbbbc
^ -- [ac] matches, end of regex, a match is found
The subsequent matches will continue on from here.
I recently had to explain Atomic Groups to someone else and I thought I'd tweak and share the example here.
Consider /the (big|small|biggest) (cat|dog|bird)/
Matches in bold
the big dog
the small bird
the biggest dog
the small cat
DEMO
For the first line, a regex engine would find the .
It would then proceed on to our adjectives (big, small, biggest), it finds big.
Having matched big, it proceeds and finds the space.
It then looks at our pets (cat, dog, bird), finds cat, skips it, and finds dog.
For the second line, our regex would find the .
It would proceed and look at big, skip it, look at and find small.
It finds the space, skips cat and dog because they don't match, and finds bird.
For the third line, our regex would find the ,
It continues on and finds big which matches the immediate requirement, and proceeds.
It can't find the space, so it backtracks (rewinds the position to the last choice it made).
It skips big, skips small, and finds biggest which also matches the immediate requirement.
It then finds the space.
It skips cat , and matches dog.
For the fourth line, our regex would find the .
It would proceed to look at big, skip it, look at and find small.
It then finds the space.
It looks at and matches cat.
Consider /the (?>big|small|biggest) (cat|dog|bird)/
Note the ?> atomic group on adjectives.
Matches in bold
the big dog
the small bird
the biggest dog
the small cat
DEMO
For the first line, second line, and fourth line, we'll get the same result.
For the third line, our regex would find the ,
It continues on and find big which matches the immediate requirement, and proceeds.
It can't find the space, but the atomic group, being the last choice the engine made, won't allow that choice to be re-examined (prohibits backtracking).
Since it can't make a new choice, the match has to fail, since our simple expression has no other choices.
This is only a basic summary. An engine wouldn't need to look at the entirety of cat to know that it doesn't match dog, merely looking at the c is enough. When trying to match bird, the c in cat and the d in dog are enough to tell the engine to examine other options.
However if you had ...((cat|snake)|dog|bird), the engine would also, of course, need to examine snake before it dropped to the previous group and examined dog and bird.
There are also plenty of choices an engine can't decide without going past what may not seem like a match, which is what results in backtracking. If you have ((red)?cat|dog|bird), The engine will look at r, back out, notice the ? quantifier, ignore the subgroup (red), and look for a match.
An "atomic group" is one where the regular expression will never backtrack past. So in your first example /a(?>bc|b)c/ if the bc alternation in the group matches, then it will never backtrack out of that and try the b alternation. If you slightly alter your first example to match against "abcdabcc" then you'll see it still matches the "abcc" at the end of the string instead of the "abc" at the start. If you don't use an atomic group, then it can backtrack past the bc and try the b alternation and end up matching the "abc" at the start.
As for question two, how it's different, that's just a rephrasing of your first question.
And lastly, atomic groups are not "called" non-capturing groups. That's not an alternate name for them. Non-capturing groups are groups that do not capture their content. Typically when you match a regular expression against a string, you can retrieve all the matched groups, and if you use a substitution, you can use backreferences in the substitution like \1 to insert the captured groups there. But a non-capturing group does not provide this. The classic non-capturing group is (?:pattern). An atomic group happens to also have the non-capturing property, hence why it's called a non-capturing group.

Automata with kleene star

Im learning about automata. Can you please help me understand how automata with Kleene closure works? Let's say I have letters a,b,c and I need to find text that ends with Kleene star - like ab*bac - how will it work?
The question seems to be more about how an automaton would handle Kleene closure than what Kleene closure means.
With a simple regular expression, e.g., abc, it's pretty straightforward to design an automaton to recognize it. Each state essentially tells you where you are in the expression so far. State 0 means it's seen nothing yet. State 1 means it's seen a. State 2 means it's seen ab. Etc.
The difficulty with Kleene closure is that a pattern like ab*bc introduces ambiguity. Once the automaton has seen the a and is then faced with a b, it doesn't know whether that b is part of the b* or the literal b that follows it, and it won't know until it reads more symbols--maybe many more.
The simplistic answer is that the automaton simply has a state that literally means it doesn't know yet which path was taken.
In simple cases, you can build this automaton directly. In general cases, you usually build something called a non-deterministic finite automaton. You can either simulate the NDFA, or--if performance is critical--you can apply an algorithm that converts the NDFA to a deterministic one. The algorithm essentially generates all the ambiguous states for you.
The Kleene star('*') means you can have as many occurrences of the character as you want (0 or more).
a* will match any number of a's.
(ab)* will match any number of the string "ab"
If you are trying to match an actual asterisk in an expression, the way you would write it depends entirely on the syntax of the regex you are working with. For the general case, the backwards slash \ is used as an escape character:
\* will match an asterisk.
For recognizing a pattern at the end, use concatenation:
(a U b)*c* will match any string that contains 0 or more 'c's at the end, preceded by any number of a's or b's.
For matching text that ends with a Kleene star, again, you can have 0 or more occurrences of the string:
ab(c)* - Possible matches: ab, abc abcc, abccc, etc.
a(bc)* - Possible matches: a, abc, abcbc, abcbcbc, etc.
Your expression ab*bac in English would read something like:
a followed by 0 or more b followed by bac
strings that would evaluate as a match to the regular expression if used for search
abac
abbbbbbbbbbac
abbac
strings that would not match
abaca //added extra literal
bac //missing leading a
As stated in the previous answer actually searching for a * would require an escape character which is implementation specific and would require knowledge of your language/library of choice.

Resources