Lookbehind with the ^ character in a Ruby regex - ruby

Why, in Ruby, do the first two regexes fail to match while the third matches?
str = 'ID: 4'
regex1 = /^(?<=ID: )\d+/
regex2 = /\A(?<=ID: )\d+/
regex3 = /(?<=ID: )\d+/
str.match(regex1) # => nil
str.match(regex2) #=> nil
str.match(regex3) #=> #<MatchData "4">
The only difference is the ^ or \A characters, which match the beginning of a line and beginning of the string, respectively. It seems both should be matched by str.

The look-behind pattern (?<=ID: ) matches a position in the string that is preceded by «ID: ».
The anchors ^ and \A match a position at the beginning of the line or string.
So the pattern \A(?<=ID: ) asks that both match together, i.e. that the beginning of the string is preceded by «ID: ». Not gonna happen!

Both of these would work fine if you put the anchor inside of the lookbehind:
regex1 = /(?<=^ID: )\d+/
regex2 = /(?<=\AID: )\d+/
If the anchors are outside of the lookbehind then you are saying "from the start of the string, are the previous characters ID:". This will always fail because there won't be any characters before the start of the string.

Look-ahead and look-behind are non-capturing/zero-length, so the first two expressions don't match.
The first expression, for instance, amounts to another way of writing: /^\d+/ (it's conditioned on \d+ not being preceded by a space, but that's not possible since there cannot be anything before ^ anyway).

In the third expression, the lookbehind can occur anywhere and specifically occurs in the zero-width space before the 4. You can see that only the 4 is matched.
With ^ or \A, the zero-width space at the beginning of the string must match the lookbehind, which is impossible.

In regex1, which is /^(?<=ID: )\d+/, there has to be a beginning of a line that is preceded by ID:. The string in question does not have such point.
In regex2, which is /\A(?<=ID: )\d+/, there has to be a beginning of a string that is preceded by ID:. There is no string that has such point.
In regex3, which is /(?<=ID: )\d+/, there has to be a point of string that is preceded by ID: and is followed by \d+. There is such point in the string.

Look-behind doesn't change position of the match.
/(?<=ID: )\d+/ is actually matched at the digit:
ID: 4
^

Related

A way to execute a block only when a unique phrase is input

For example I have two commands in a program:
elsif ##content.downcase.include? "!roll"
randomRoll = rand(100)
puts("You rolled a #{add_comma_to_numbers(randomRoll.to_s)} (1-100)")
elsif ##content.downcase.include? "!roll1m"
randomRoll = rand(1000000)
puts("You rolled a #{add_comma_to_numbers(randomRoll.to_s)} (1-1m)")
Because !roll1m has the same prefix as !roll, they both execute when !roll is input, which I only want the top one to execute.
Perhaps there is another method to use besides include??
I have thought about just simply flipping the two, however that would just be a temporary fix. Any suggestions would help
You can do that as follows:
r = /!roll\b/
which reads, "match the string '!roll' followed by a word break".
'!roll'.match?(r)
#=> true
'I am on a !roll today'.match?(r)
#=> true
'!roll1m'.match?(r)
#=> false
If, in addition, we wish to match, "tootsie !roll" but not "tootsie!roll", we could use either of the following regular expressions:
r = /(?<=\A|\W)!roll\b/
r = /(?<!\w)!roll\b/
(?<=\A|\W), a positive lookbehind ((?<=...)), requires the string "!roll" to be immediately preceded by the beginning of the string (\A) or (|) by a non-word character (\W).
(?<!\w), a negative lookbehind ((?<!...)), requires the string "!roll" to not be immediately preceded a word character (\w).
\b requires that '!roll' be followed by a word break, meaning that '!roll' must be followed by a non-word character or be at the end of the string.
Both return the following:
"tootsie !roll".match?(r) #=> true
"tootsie!roll".match?(r) #=> false
I initially expected I could use the regular expression
r = /\b!roll\b/
but as I was reminded in the comments, the word-break character, \b, requires there to be a word character on one side and not a word character on the other side. This includes a word character before and no characters after or a word character after and no characters before.1
For example,
r = /\b!roll\b/
"!roll".match?(r)
#=> false (\W ('!') after \b but no \w before '!')
"tootsie!roll".match(r)
#=> true (\w before \b, \W ('!') after \b)
"tootsie !roll".match(r)
#=> false (\W before and after \b)
1. This is not made clear in the documentation for Regexp (v2.7.0), which states, "\b - Matches word boundaries when outside brackets; backspace (0x08) when inside brackets".

Regex: Match all hyphens or underscores not at the beginning or the end of the string

I am writing some code that needs to convert a string to camel case. However, I want to allow any _ or - at the beginning of the code.
I have had success matching up an _ character using the regex here:
^(?!_)(\w+)_(\w+)(?<!_)$
when the inputs are:
pro_gamer #matched
#ignored
_proto
proto_
__proto
proto__
__proto__
#matched as nerd_godess_of, skyrim
nerd_godess_of_skyrim
I recursively apply my method on the first match if it looks like nerd_godess_of.
I am having troubled adding - matches to the same, I assumed that just adding a - to the mix like this would work:
^(?![_-])(\w+)[_-](\w+)(?<![_-])$
and it matches like this:
super-mario #matched
eslint-path #matched
eslint-global-path #NOT MATCHED.
I would like to understand why the regex fails to match the last case given that it worked correctly for the _.
The (almost) full set of test inputs can be found here
The fact that
^(?![_-])(\w+)[_-](\w+)(?<![_-])$
does not match the second hyphen in "eslint-global-path" is because of the anchor ^ which limits the match to be on the first hyphen only. This regex reads, "Match the beginning of the line, not followed by a hyphen or underscore, then match one or more words characters (including underscores), a hyphen or underscore, and then one or more word characters in a capture group. Lastly, do not match a hyphen or underscore at the end of the line."
The fact that an underscore (but not a hyphen) is a word (\w) character completely messes up the regex. In general, rather than using \w, you might want to use \p{Alpha} or \p{Alnum} (or POSIX [[:alpha:]] or [[:alnum:]]).
Try this.
r = /
(?<= # begin a positive lookbehind
[^_-] # match a character other than an underscore or hyphen
) # end positive lookbehind
( # begin capture group 1
(?: # begin a non-capture group
-+ # match one or more hyphens
| # or
_+ # match one or more underscores
) # end non-capture group
[^_-] # match any character other than an underscore or hyphen
) # end capture group 1
/x # free-spacing regex definition mode
'_cats_have--nine_lives--'.gsub(r) { |s| s[-1].upcase }
#=> "_catsHaveNineLives--"
This regex is conventionally written as follows.
r = /(?<=[^_-])((?:-+|_+)[^_-])/
If all the letters are lower case one could alternatively write
'_cats_have--nine_lives--'.split(/(?<=[^_-])(?:_+|-+)(?=[^_-])/).
map(&:capitalize).join
#=> "_catsHaveNineLives--"
where
'_cats_have--nine_lives--'.split(/(?<=[^_-])(?:_+|-+)(?=[^_-])/)
#=> ["_cats", "have", "nine", "lives--"]
(?=[^_-]) is a positive lookahead that requires the characters on which the split is made to be followed by a character other than an underscore or hyphen
you can try the regex
^(?=[^-_])(\w+[-_]\w*)+(?=[^-_])\w$
see the demo here.
Switch _- to -_ so that - is not treated as a range op, as in a-z.

Regex match number of backslashes but not more than x number

Im trying to match double back slashes in a string, but only when there is 2 and not 3 so I can swap out the 2 for 3.
I know that \\{2} will match double back slash except it will also match the first 2 slashes when 3 are present.
For example in the string
{"files":{"windows": {"%windir%\\\System32\\drivers\\etc\\lmhosts.sam":{"ignore":{"id":32}},"%windir%\\System32\\\drivers\\etc":{"ignore":{"id":32}},"%windir%\\System32\\drivers\\etc\\hosts":{"ignore":{"id":32}}}}}
There are multiple double slashes that I wish to match and replace, but there are also a few triple slashes which I wish to leave alone.
So, my question, how do match the double slash when it does not sit adjacent to another slash?
Heres a Regex101 link to toy with.
https://regex101.com/r/kWIscW/1
Also, doing this in Ruby.
How about:
\b\\{2}\b
To define that you \\ are the only one characters evaluated
Another possibility to is looking behind and look ahead, however, not sure your regex engine supports it:
(?<=[^\\])\\{2}(?=[^\\])
r = /
(?<!\\) # do not match a backslash, negative lookbehind
\\\\ # match two backslashes
(?!\\) # do not match a backslash, negative lookahead
/x # free-spacing regex definition mode
str = "\\\\\ are two backslashes and here are three \\\\\\ of 'em"
puts str
# \\ are two backslashes and here are three \\\ of 'em
str.scan(r)
#=> ["\\\\"]
Note that s = "\\\\\ " is two backslashes followed by an escaped space.
s.size
#=> 3
s[0].ord
#=> 92
92.chr
#=> "\\"
s[1].ord
#=> 92
s[2].ord
#=> 32
Let's first address backslashes in string literals,
"\\" is one backslash
"\\\\" are two backslashes
"\\\\\\" are three backslashes
Why? Backslash is the escape sequence in string literals, eg "\n" is a linebreak, and hence a backslash must be escaped with a backslash to encode one backslash.
Now, try this
string = "\\\\aaa\\\\bbb\\\\\\ccc"
string.gsub(/\\+/) { |match| match.size == 2 ? '/' : match }
# => "/aaa/bbb\\\\\\ccc"
How does this work?
/\\+/ matches any sequence of backslashes
match.size == 2 filters those that have length 2
And then we just replace those

Regex condition on first and last characters

How can I write a regex to match a string that does not start or end with a white space character? A matching string can have any character in the middle, and importantly, a single-character string should match.
My attempt was:
/\A\S.*\S\z/
but this will not match a single character.
This is one of the cases where you should not attempt to build a regex that matches something, but rather a regex that matches the complement of something, and use the regex negatively.
re = /\A\s|\s\z/
re !~ " " # => false
re !~ "" # => true
re !~ "sss" # => true
re !~ "s ss" # => true
re !~ " s ss" # => false
is_ok = lambda do |str|
a, z = str.chars.first, str.chars.last
"#{a}#{z}" =~ / |\n|\t/ ? false : true
end
#"more elegant" (yeah dude I rock)
is_ok = lambda {|str| [0, -1].map{|i| str.chars[i] }.join =~ / |\n|\t/ ? false : true}
Use this regex:
\A\S+(?:\s*\S+)*\Z
You can play with the Test String part of this demo to see how this works. I'm assuming that strings can span multiple lines, hence the \A and \Z
In Ruby, something like:
if subject =~ /\A\S+(?:\s*\S+)*\Z/
match = $&
Explanation
The \A anchor asserts that we are at the beginning of the subject string
\S+ matches one or more non-whitespace characters (including tabs, newlines etc.) Alternaltely, if you want to allow newlines at the beginning but only want to exclude a space character, you can use [^ ]+ instead of \S+
(?:\s*\S+) matches any number of optional whitespace characters, followed by one or more non-space characters
The * quantifier repeats that zero or more times
The \Z anchor asserts that we are at the end of the subject string
Use lookaheads, like this:
\A(?=\S).*\S\Z
Regex101 Demo
This matches the start of the string and requires (1) that the first character be a non-whitespace character and (2) that the last character be a non-whitespace character.
Matches:
a
a b
a b c d 1231 e
Non matches:
(just a space)
a (leading space)
b (trailing space)
empty string

Regex: don't match if string contains whitespace

I can't seem to figure out the regex pattern for matching strings only if it doesn't contain whitespace. For example
"this has whitespace".match(/some_pattern/)
should return nil but
"nowhitespace".match(/some_pattern/)
should return the MatchData with the entire string. Can anyone suggest a solution for the above?
In Ruby I think it would be
/^\S*$/
This means "start, match any number of non-whitespace characters, end"
You could always search for spaces, an then negate the result:
"str".match(/\s/).nil?
>> "this has whitespace".match(/^\S*$/)
=> nil
>> "nospaces".match(/^\S*$/)
=> #<MatchData "nospaces">
^ = Beginning of string
\S = non-whitespace character, * = 0 or more
$ = end of string
Not sure you can do it in one pattern, but you can do something like:
"string".match(/pattern/) unless "string".match(/\s/)
"nowhitespace".match(/^[^\s]*$/)
You want:
/^\S*$/
That says "match the beginning of the string, then zero or more non-whitespace characters, then the end of the string." The convention for pre-defined character classes is that a lowercase letter refers to a class, while an uppercase letter refers to its negation. Thus, \s refers to whitespace characters, while \S refers to non-whitespace.
str.match(/^\S*some_pattern\S*$/)

Resources