How does backtracking differ from back-referencing in regular expressions? - ruby

How does backtracking differ from back-referencing in regular expressions?
How does back-referencing win limitation having with backtracking or vice-versa?

Backtracking is a way for a state machine to back up and retry other matches for a regular expression. It's something that's pretty much internal to the regex engine.
For example, say you're trying to match the regex [a-z]*a, any number of lower case characters followed by an a.
Given the input abca, a greedy match will assign all of that to the [a-z] portion of the regex but then there's no way to match the final a. Backtracking allows the engine to back up by returning that final a to the input stream and trying again, assigning abc to the [a-z] portion and a to the a portion.
Back-referencing on the other hand, is a means for a user of the regex engine to reference previously captured groups. For example,
s/^([a-z])([a-z])/\1_\2/
\_____/\_____/
| |
| +- capture group 2
+-------- capture group 1
may be a command to insert _ between two consecutive lower case letters at the start of each line. The \N back-reference (where N represents a number) refers back to the groups captured within ().

Related

Need help understanding why this string in grep pulls IP addresses rather than this other string

The following statement is from a homework question which I tested out and answered, but I'm just not understanding how come this line behaves the way it does and I want to understand why. I realize why this expression is flawed to find an IP address but I don't fully understand why it behaves the way it does since it seems as if the question mark doesn't actually behave as 0 or 1 times in like it's supposed to.
"user#machine:~$ grep -E '[01]?[0-9][0-9]?' "
To my understanding "[01]?" should look for any number 0-1 as indicated by the brackets while the question mark tells grep to look for zero or one instance only and similar with "[0-9]?". Thing is this line will print an unlimited number of digits far exceeding 3 digits. I ruled out that it was due to the 3rd bracket that didn't have a proceeding question mark since it would still print an unlimited amount of digits if I piped an echo or used a testing .txt file full of numbers.
This above example made me than wonder how to find IP's with grep the correct way. So I found countless examples like the following expression for IPv4 octets:
\.(25[0-5]\|2[0-4][0-9]\|[01][0-9][0-9]\|[0-9][0-9]).\
Is this telling me to look for any number 2-5 anywhere from 0-5 times? 0-5 is too many digits for an octet. Is it telling me to look for any number 0-5 up to 25 times? Again that's way too many digits for an octet. What does \2[0-4][0-9]\ mean in this case? I'm confused about how this expression finds numbers strictly between 1-255?
Look at it this way: x?[0-9]x? matches anything which contains a digit because both the x:es are optional. You might as well leave them out because they do not constrain the match at all.
25[0-5] looks for 25 followed by a digit in the range 0-5. In other words, the expression matches a number in the range 250-255.
The full expression in your example looks for a number in the range 00-255 by enumerating strings beginning with 25, 20-24, etc; though it's incomplete in that it doesn't permit single-digit numbers.
The expression matches a single octet (incompletely), not an entire IP address. Here is a common way to match an IPv4 address:
([3-9][0-9]?|2([0-4][0-9]?|5[0-9]?|[6-9])?|1([0-9][0-9]?)?)(\.([3-9][0-9]?|2([0-4][0-9]?|5[0-9]?|[6-9])?|1([0-9][0-9]?)?){3}
where the square brackets express character classes which match a single character out of a set, and the final curly braces {3} express a repetition.
Some regex dialects (e.g. POSIX grep) require backslashes before | and \( but I have used the extended notation (a la grep -E and most online regex exploration tools) which doesn't want backslashes.

How to determine the number of grouped numbers in a string in bash

I have a string in bash
string="123 abc 456"
Where numbers that are grouped together are considered 1 number.
"123" and "456" would be considered numbers in this case.
How can i determine the number of grouped together numbers?
so
"123"
is determined to be a string with just one number, and
"123 abc 456"
is determined to be a string with 2 numbers.
egrep -o '[0-9]+' <<<"$string" | wc -l
Explanation
egrep: This performs an extended regular expression match on the lines of a given file (or, in this case, a herestring). It usually returns lines of text within the string that contain at least one chunk of text that matches the supplied pattern. However, the -o flag tells it to return only those matching chunks, one per line of output.
'[0-9]+': This is the regular expression that the string is compared against. Here, we are telling it to match successive runs of 1 or more digits, and no other character.
<<< The herestring operator allows us to pass a string into a command as if were the contents of a file.
| This pipes the output of the previous command (egrep) to become the input for the next command (wc).
wc: This performs a word count, normally returning the number of words in a given argument. However, the -l tells it to do a line count instead.
UPDATE: 2018-08-23
Is there any way to adapt your solution to work with floats?
The regular expression that matches both integer numbers and floating point decimal numbers would be something like this: '[0-9]*\.?[0-9]+'. Inserting this into the command above in place of its predecessor, forms this command chain:
egrep -o '[0-9]*\.?[0-9]+' <<<"$string" | wc -l
Focussing now only on the regular expression, here's how it works:
[0-9]: This matches any single digit from 0 to 9.
*: This is an operator that applies to the expression that comes directly before it, i.e. the [0-9] character class. It tells the search engine to match any number of occurrences of the digits 0 to 9 instead of just one, but no other character. Therefore, it will match "2", "26", "4839583", ... but it will not match "9.99" as a singular entity (but will, of course, match the "9" and the "99" that feature within it). As the * operator matches any number of successive digits, this can include zero occurrences (this will become relevant later).
\.: This matches a singular occurrence of a period (or decimal point), ".". The backslash is a special character that tells the search engine to interpret the period as a literal period, because this character itself has special function in regular expression strings, acting as a wildcard to match any character except a line-break. Without the backslash, that's what it would do, which would potentially match "28s" if it came across it, where the "s" was caught by the wildcard period. However, the backslash removes the wildcard functionality, so it will now only match with an actual period.
?: Another operator, like the *, except this one tells the search engine to match the previous expression either zero or one times, but no more. In other words, it makes the decimal point optional.
[0-9]+: As before, this will match digits 0 to 9, the number of which here is determined by the + operator, which standards for at least one, i.e. one or more digits.
Applying this to the following string:
"The value of pi is approximately 3.14159. The value of e is about 2.71828. The Golden Ratio is approximately 1.61803, which can be expressed as (√5 + 1)/2."
yields the following matches (one per line):
3.14159
2.71828
1.61803
5
1
2
And when this is piped through the wc -l command, returns a count of the lines, which is 6, i.e. the supplied string contains 6 occurrences of number strings, which includes integers and floating point decimals.
If you wanted only the floating point decimals, and to exclude the integers, the regular expression is this:
'[0-9]*\.[0-9]+'
If you look carefully, it's identical to the previous regular expression, except for the missing ? operator. If you recall, the ? made the decimal point an optional feature to match; removing this operator now means the decimal point must be present. Likewise, the + operator is matching at least one instance of a digit following the decimal point. However, the * operator before it matches any number of digits, including zero digits. Therefore, "0.61803" would be a valid match (if it were present in the string, which it isn't), and ".33333" would also be a valid match, since the digits before the decimal point needn't be there thanks to the * operator. However, whilst "1.1111" could be a valid match, "1111." would not be, because + operator dictates that there must be at least one digit following the decimal point.
Putting it into the command chain:
egrep -o '[0-9]*\.[0-9]+' <<<"$string" | wc -l
returns a value of 3, for the three floating point decimals occurring in the string, which, if you remove the | wc -l portion of the command, you will see in the terminal output as:
3.14159
2.71828
1.61803
For reasons I won't go into, matching integers exclusively and excluding floating point decimals is harder to accomplish with Perl-flavoured regular expression matching (which egrep is not). However, since you're really only interested in the number of these occurrences, rather than the matches themselves, we can create a regular expression that doesn't need to worry about accurate matching of integers, as long as it produces the same number of matched items. This expression:
'[^.0-9][0-9]+(\.([^0-9]|$)|[^.])'
seems to be good enough for counting the integers in the string, which includes the 5, 1 and 2 (ignoring, of course, the √ symbol), returning these approximately matches substrings:
√5
1)
/2.
I haven't tested it that thoroughly, however, and only formulated it tonight when I read your comment. But, hopefully, you are beginning to get a rough sense of what's going on.
In case you need to know the number of grouped digits in string then following may help you.
string="123 abc 456"
echo "$string" | awk '{print gsub(/[0-9]+/,"")}'
Explanation: Adding explanation too here, following is only for explanation purposes.
string="123 abc 456" ##Creating string named string with value of 123 abc 456.
echo "$string" ##Printing value of string here with echo.
| ##Putting its output as input to awk command.
awk '{ ##Initializing awk command here.
print gsub(/[0-9]+/,"") ##printing value of gsub here(where gsub is for substituting the values of all digits in group with ""(NULL)).
it will globally substitute the digits and give its count(how many substitutions happens will be equal to group of digits present).
}' ##Closing awk command here.

Can someone give me an example of regular expressions using {x} and {x,y}?

I just learned from a book about regular expressions in the Ruby language. I did Google it, but still got confused about {x} and {x,y}.
The book says:
{x}→Match x occurrences of the preceding character.
{x,y}→Match at least x occurrences and at most y occurrences.
Can anyone explain this better, or provide some examples?
Sure, look at these examples:
http://rubular.com/r/sARHv0vf72
http://rubular.com/r/730Zo6rIls
/a{4}/
is the short version for:
/aaaa/
It says: Match exact 4 (consecutive) characters of 'a'.
where
/a{2,4}/
says: Match at least 2, and at most 4 characters of 'a'.
it will match
/aa/
/aaa/
/aaaa/
and it won't match
/a/
/aaaaa/
/xxx/
Limiting Repetition good online tutorial for this.
I highly recommend regexbuddy.com and very briefly, the regex below does what you refer to:
[0-9]{3}|\w{3}
The [ ] characters indicate that you must match a number between 0 and 9. It can be anything, but the [ ] is literal match. The { } with a 3 inside means match sets of 3 numbers between 0 and 9. The | is an or statement. The \w, is short hand for any word character and once again the {3} returns only sets of 3.
If you go to RegexPal.com you can enter the code above and test it. I used the following data to test the expression:
909 steve kinzey
and the expression matched the 909, the 'ste', the 'kin' and the 'zey'. It did not match the 've' because it is only 2 word characters long and a word character does not span white space so it could not carry over to the second word.
Interval Expressions
GNU awk refers to these as "interval expressions" in the Regexp Operators section of its manual. It explains the expressions as follows:
{n}
{n,}
{n,m}
One or two numbers inside braces denote an interval expression. If there is one number in the braces, the preceding regexp is repeated n times. If there are two numbers separated by a comma, the preceding regexp is repeated n to m times. If there is one number followed by a comma, then the preceding regexp is repeated at least n times:
The manual also includes these reference examples:
wh{3}y
Matches ‘whhhy’, but not ‘why’ or ‘whhhhy’.
wh{3,5}y
Matches ‘whhhy’, ‘whhhhy’, or ‘whhhhhy’, only.
wh{2,}y
Matches ‘whhy’ or ‘whhhy’, and so on.
See Also
Ruby's Regexp class.
Quantifiers section of Ruby's oniguruma engine.

Confusion with Atomic Grouping - how it differs from the Grouping in regular expression of Ruby?

I have gone through the docs for Atomic Grouping and rubyinfo and some questions came into my mind:
Why the name "Atomic grouping"? What "atomicity" does it have that general grouping doesn't?
How does atomic grouping differ to general grouping?
Why are atomic groups called non-capturing groups?
I tried the below code to understand but had confusion about the output and how differently they work on the same string as well?
irb(main):001:0> /a(?>bc|b)c/ =~ "abbcdabcc"
=> 5
irb(main):004:0> $~
=> #<MatchData "abcc">
irb(main):005:0> /a(bc|b)c/ =~ "abcdabcc"
=> 0
irb(main):006:0> $~
=> #<MatchData "abc" 1:"b">
A () has some properties (include those such as (?!pattern), (?=pattern), etc. and the plain (pattern)), but the common property between all of them is grouping, which makes the arbitrary pattern a single unit (unit is my own terminology), which is useful in repetition.
The normal capturing (pattern) has the property of capturing and group. Capturing means that the text matches the pattern inside will be captured so that you can use it with back-reference, in matching or replacement. The non-capturing group (?:pattern) doesn't have the capturing property, so it will save a bit of space and speed up a bit compared to (pattern) since it doesn't store the start and end index of the string matching the pattern inside.
Atomic grouping (?>pattern) also has the non-capturing property, so the position of the text matched inside will not be captured.
Atomic grouping adds property of atomic compared to capturing or non-capturing group. Atomic here means: at the current position, find the first sequence (first is defined by how the engine matches according to the pattern given) that matches the pattern inside atomic grouping and hold on to it (so backtracking is disallowed).
A group without atomicity will allow backtracking - it will still find the first sequence, then if the matching ahead fails, it will backtrack and find the next sequence, until a match for the entire regex expression is found or all possibilities are exhausted.
Example
Input string: bbabbbabbbbc
Pattern: /(?>.*)c/
The first match by .* is bbabbbabbbbc due to the greedy quantifier *. It will hold on to this match, disallowing c from matching. The matcher will retry at the next position to the end of the string, and the same thing happens. So nothing matches the regex at all.
Input string: bbabbbabbbbc
Pattern: /((?>.*)|b*)[ac]/, for testing /(((?>.*))|(b*))[ac]/
There are 3 matches to this regex, which are bba, bbba, bbbbc. If you use the 2nd regex, which is the same but with capturing groups added for debugging purpose, you can see that all the matches are result of matching b* inside.
You can see the backtracking behavior here.
Without the atomic grouping /(.*|b*)[ac]/, the string will have a single match which is the whole string, due to backtracking at the end to match [ac]. Note that the engine will go back to .* to backtrack by 1 character since it still have other possibilities.
Pattern: /(.*|b*)[ac]/
bbabbbabbbbc
^ -- Start matching. Look at first item in alternation: .*
bbabbbabbbbc
^ -- First match of .*, due to greedy quantifier
bbabbbabbbbc
X -- [ac] cannot match
-- Backtrack to ()
bbabbbabbbbc
^ -- Continue explore other possibility with .*
-- Step back 1 character
bbabbbabbbbc
^ -- [ac] matches, end of regex, a match is found
With the atomic grouping, all possibilities of .* is cut off and limited to the first match. So after greedily eating the whole string and fail to match, the engine have to go for the b* pattern, where it successfully finds a match to the regex.
Pattern: /((?>.*)|b*)[ac]/
bbabbbabbbbc
^ -- Start matching. Look at first item in alternation: (?>.*)
bbabbbabbbbc
^ -- First match of .*, due to greedy quantifier
-- The atomic grouping will disallow .* to be backtracked and rematched
bbabbbabbbbc
X -- [ac] cannot match
-- Backtrack to ()
-- (?>.*) is atomic, check the next possibility by alternation: b*
bbabbbabbbbc
^ -- Starting to rematch with b*
bbabbbabbbbc
^ -- First match with b*, due to greedy quantifier
bbabbbabbbbc
^ -- [ac] matches, end of regex, a match is found
The subsequent matches will continue on from here.
I recently had to explain Atomic Groups to someone else and I thought I'd tweak and share the example here.
Consider /the (big|small|biggest) (cat|dog|bird)/
Matches in bold
the big dog
the small bird
the biggest dog
the small cat
DEMO
For the first line, a regex engine would find the .
It would then proceed on to our adjectives (big, small, biggest), it finds big.
Having matched big, it proceeds and finds the space.
It then looks at our pets (cat, dog, bird), finds cat, skips it, and finds dog.
For the second line, our regex would find the .
It would proceed and look at big, skip it, look at and find small.
It finds the space, skips cat and dog because they don't match, and finds bird.
For the third line, our regex would find the ,
It continues on and finds big which matches the immediate requirement, and proceeds.
It can't find the space, so it backtracks (rewinds the position to the last choice it made).
It skips big, skips small, and finds biggest which also matches the immediate requirement.
It then finds the space.
It skips cat , and matches dog.
For the fourth line, our regex would find the .
It would proceed to look at big, skip it, look at and find small.
It then finds the space.
It looks at and matches cat.
Consider /the (?>big|small|biggest) (cat|dog|bird)/
Note the ?> atomic group on adjectives.
Matches in bold
the big dog
the small bird
the biggest dog
the small cat
DEMO
For the first line, second line, and fourth line, we'll get the same result.
For the third line, our regex would find the ,
It continues on and find big which matches the immediate requirement, and proceeds.
It can't find the space, but the atomic group, being the last choice the engine made, won't allow that choice to be re-examined (prohibits backtracking).
Since it can't make a new choice, the match has to fail, since our simple expression has no other choices.
This is only a basic summary. An engine wouldn't need to look at the entirety of cat to know that it doesn't match dog, merely looking at the c is enough. When trying to match bird, the c in cat and the d in dog are enough to tell the engine to examine other options.
However if you had ...((cat|snake)|dog|bird), the engine would also, of course, need to examine snake before it dropped to the previous group and examined dog and bird.
There are also plenty of choices an engine can't decide without going past what may not seem like a match, which is what results in backtracking. If you have ((red)?cat|dog|bird), The engine will look at r, back out, notice the ? quantifier, ignore the subgroup (red), and look for a match.
An "atomic group" is one where the regular expression will never backtrack past. So in your first example /a(?>bc|b)c/ if the bc alternation in the group matches, then it will never backtrack out of that and try the b alternation. If you slightly alter your first example to match against "abcdabcc" then you'll see it still matches the "abcc" at the end of the string instead of the "abc" at the start. If you don't use an atomic group, then it can backtrack past the bc and try the b alternation and end up matching the "abc" at the start.
As for question two, how it's different, that's just a rephrasing of your first question.
And lastly, atomic groups are not "called" non-capturing groups. That's not an alternate name for them. Non-capturing groups are groups that do not capture their content. Typically when you match a regular expression against a string, you can retrieve all the matched groups, and if you use a substitution, you can use backreferences in the substitution like \1 to insert the captured groups there. But a non-capturing group does not provide this. The classic non-capturing group is (?:pattern). An atomic group happens to also have the non-capturing property, hence why it's called a non-capturing group.

Automata with kleene star

Im learning about automata. Can you please help me understand how automata with Kleene closure works? Let's say I have letters a,b,c and I need to find text that ends with Kleene star - like ab*bac - how will it work?
The question seems to be more about how an automaton would handle Kleene closure than what Kleene closure means.
With a simple regular expression, e.g., abc, it's pretty straightforward to design an automaton to recognize it. Each state essentially tells you where you are in the expression so far. State 0 means it's seen nothing yet. State 1 means it's seen a. State 2 means it's seen ab. Etc.
The difficulty with Kleene closure is that a pattern like ab*bc introduces ambiguity. Once the automaton has seen the a and is then faced with a b, it doesn't know whether that b is part of the b* or the literal b that follows it, and it won't know until it reads more symbols--maybe many more.
The simplistic answer is that the automaton simply has a state that literally means it doesn't know yet which path was taken.
In simple cases, you can build this automaton directly. In general cases, you usually build something called a non-deterministic finite automaton. You can either simulate the NDFA, or--if performance is critical--you can apply an algorithm that converts the NDFA to a deterministic one. The algorithm essentially generates all the ambiguous states for you.
The Kleene star('*') means you can have as many occurrences of the character as you want (0 or more).
a* will match any number of a's.
(ab)* will match any number of the string "ab"
If you are trying to match an actual asterisk in an expression, the way you would write it depends entirely on the syntax of the regex you are working with. For the general case, the backwards slash \ is used as an escape character:
\* will match an asterisk.
For recognizing a pattern at the end, use concatenation:
(a U b)*c* will match any string that contains 0 or more 'c's at the end, preceded by any number of a's or b's.
For matching text that ends with a Kleene star, again, you can have 0 or more occurrences of the string:
ab(c)* - Possible matches: ab, abc abcc, abccc, etc.
a(bc)* - Possible matches: a, abc, abcbc, abcbcbc, etc.
Your expression ab*bac in English would read something like:
a followed by 0 or more b followed by bac
strings that would evaluate as a match to the regular expression if used for search
abac
abbbbbbbbbbac
abbac
strings that would not match
abaca //added extra literal
bac //missing leading a
As stated in the previous answer actually searching for a * would require an escape character which is implementation specific and would require knowledge of your language/library of choice.

Resources