Precedence of Ruby regular expressions? - ruby

I am reviewing regular expressions and cannot understand why a regular expression won't match a given string, specifically:
regex = /(ab*)+(bc)?/
mystring = "abbc"
The match matches "abb" but leaves the c off. I tested this using Rubular and in IRB and don't understand why the regex doesn't match the entire string. I thought that (ab*)+ would match "ab" and then (bc)? would match "bc".
Am I missing something in terms of precedence for regular expression operations?

Regular expressions try to match the first part of the regular expression as much as possible by default, and they do not backtrack to try to make larger sections match if they don't have to. Since you make (bc) optional, the (ab*) can match as much as it wants (the non-zero repetition after it doesn't have much to do) and doesn't try backtracking to try other matching alternatives.
If you want the whole string to be matched (which will force some backtracking in this case) make sure you anchor both ends of the string:
regex = /^(ab*)+(bc)?$/

The regex with parenthesis assumes you have two matches in your string.
The first one is abb because (ab*) means a and zero or more b. You have two b, so the match is abb. Then you have only c in your string, so it doesn't match the second condition which is bc.

Related

Ensure non-matching of a pattern within a scope

I am trying to create a regex that matches a pattern in some part of a string, but not in another part of the string.
I am trying to match a substring that
(i) is surrounded by a balanced pair of one or more consecutive backticks `
(ii) and does not include as many consecutive backticks as in the surrounding patterns
(iii) where the surrounding patterns (sequence of backticks) are not adjacent to other backticks.
This is some variant of the syntax of inline code notation in Markdown syntax.
Examples of matches are as follows:
"xxx`foo`yyy" # => matches "foo"
"xxx``foo`bar`baz``yyy" # => matches "foo`bar`baz"
"xxx```foo``bar``baz```yyy" # => matches "foo``bar``baz"
One regex to achieve this is:
/(?<!`)(?<backticks>`+)(?<inline>.+?)\k<backticks>(?!`)/
which uses a non-greedy match.
I was wondering if I can get rid of the non-greedy match.
The idea comes from when the prohibited pattern is a single character. When I want to match a substring that is surrounded by a single quote ' that does not include a single quote in it, I can do either:
/'.+?'/
/'[^']+'/
The first one uses non-greedy match, and the second one uses an explicit non-matching pattern [^'].
I am wondering if it is possible to have something like the second form when the prohibited pattern is not a single character.
Going back to the original issue, there is negative lookahead syntax(?!), but I cannot restrict its effective scope. If I make my regex like this:
/(?<!`)(?<backticks>`+)(?<inline>(?!.*\k<backticks>).*)\k<backticks>(?!`)/
then the effect of (?!.*\k<backticks>) will not be limited to within (?<inline>...), but will extend to the whole string. And since that contradicts with the \k<backticks> at the end, the regex fails to match.
Is there a regex technique to ensure non-matching of a pattern (not-necessarily a single character) within a certain scope?
You can search for one or more characters which aren't the first character of a delimiter:
/(?<!`)(?<backticks>`+)(?<inline>(?:(?!\k<backticks>).)+)\k<backticks>(?!`)/

Why won't a longer token in an alternation be matched?

I am using ruby 2.1, but the same thing can be replicated on rubular site.
If this is my string:
儘管中國婦幼衛生監測辦公室制定的
And I do a regex match with this expression:
(中國婦幼衛生監測辦公室制定|管中)
I am expecting to get the longer token as a match.
中國婦幼衛生監測辦公室制定
Instead I get the second alternation as a match.
As far as I know it does work like that when not in chinese characters.
If this is my string:
foobar
And I use this regex:
(foobar|foo)
Returned matching result is foobar. If the order is in the other way, than the matching string is foo. That makes sense to me.
Your assumption that regex matches a longer alternation is incorrect.
If you have a bit of time, let's look at how your regex works...
Quick refresher: How regex works: The state machine always reads from left to right, backtracking where necessary.
There are two pointers, one on the Pattern:
(cdefghijkl|bcd)
The other on your String:
abcdefghijklmnopqrstuvw
The pointer on the String moves from the left. As soon as it can return, it will:
(source: gyazo.com)
Let's turn that into a more "sequential" sequence for understanding:
(source: gyazo.com)
Your foobar example is a different topic. As I mentioned in this post:
How regex works: The state machine always reads from left to right. ,|,, == ,, as it always will only be matched to the first alternation.
    That's good, Unihedron, but how do I force it to the first alternation?
Look!*
^(?:.*?\Kcdefghijkl|.*?\Kbcd)
Here have a regex demo.
This regex first attempts to match the entire string with the first alternation. Only if it fails completely will it then attempt to match the second alternation. \K is used here to keep the match with the contents behind the construct \K.
*: \K was supported in Ruby since 2.0.0.
Read more:
The Stack Overflow Regex Reference
On greedy vs non-greedy
Ah, I was bored, so I optimized the regex:
^(?:(?:(?!cdefghijkl)c?[^c]*)++\Kcdefghijkl|(?:(?!bcd)b?[^b]*)++\Kbcd)
You can see a demo here.

Ruby string does not match expression

I have this ruby expression as below
(a|bc)(d?|e)*
when i use rubular to test out possible strings that fit this expression, I have some strings that I dont understand why they dont fit
the strings are "ade", it matches "ad" but does not match the "e". Anyone can help?
The second part of the regular expression you entered (d?|e)* is the problem. Putting the ? on the d says, match d 0 or 1 times. When you run through the string ade, the regex matches a, then d, then d 0 times... If you instead changed it to (a|bc)(d|e)*, it would match ade, and seem to have the semantics that you're looking for.
(d?)* is a non-greedy match and e* will be "short circuited" by logic or. It will match as few as possible.
I don't know why you put a question mark there. Just use
(a|bc)(d|e)*
Will be fine.

How can I write a regex in Ruby that will determine if a string meets this criteria?

How can I write a regex in Ruby 1.9.2 that will determine if a string meets this criteria:
Can only include letters, numbers and the - character
Cannot be an empty string, i.e. cannot have a length of 0
Must contain at least one letter
/\A[a-z0-9-]*[a-z][a-z0-9-]*\z/i
It goes like
beginning of string
some (or zero) letters, digits and/or dashes
a letter
some (or zero) letters, digits and/or dashes
end of string
I suppose these two will help you: /\A[a-z0-9\-]{1,}\z/i and /[a-z]{1,}/i. The first one checks on first two rules and the second one checks for the last condition.
No regex:
str.count("a-zA-Z") > 0 && str.count("^a-zA-Z0-9-") == 0
You can take a look at this tutorial for how to use regular expressions in ruby. With regards to what you need, you can use the following:
^[A-Za-z0-9\-]+$
The ^ will instruct the regex engine to start matching from the very beginning of the string.
The [..] will instruct the regex engine to match any one of the characters they contain.
A-Z mean any upper case letter, a-z means any lower case letter and 0-9 means any number.
The \- will instruct the regex engine to match the -. The \ is used infront of it because the - in regex is a special symbol, so it needs to be escaped
The $ will instruct the regex engine to stop matching at the end of the line.
The + instructs the regex engine to match what is contained between the square brackets one or more time.
You can also use the \i flag to make your search case insensitive, so the regex might become something like this:
^[a-z0-9\-]+/i$

Ruby regex, is there a way to only match literal matches?

I'm trying to parse using a case/when statement with regex in it. I'm having some trouble with the match as it will give me a match even if it's not a literal match.
Example:
if I input ($45, x), I get back: "address mode: indirect, x -> value: 45" from this regex:
/[(][$][1-9a-fA-F]{1,2}\s*,\s*[xX]\s*[)]/
Now, if I input ($45, p), I get a match for this regex:
/[$][1-9a-fA-F]{2,4}/
Which is understandable, but I would like the match to look only for literal matches. If there are extra characters that does not exactly match the regex I want the match function to return false.
Is there some other functions like match() or extra arguments that can be given to match() to get this behavior?
From your question, it is a little unclear what you are after. Your second regex is matching on the substring
$45
If you want to avoid this, use the anchors ^ and $ to ensure the entire string is matched. Something like:
^\(\$[1-9A-Za-z]+,\s*[xX]\s*\)$

Resources