Understanding negative look aheads in regular expressions - ruby

I want to match urls that do NOT contain the string 'localhost' using Ruby regex
Based on answers and comments here, I put together two solutions, both of which seem to work:
Solution A:
(?!.*localhost)^.*$
Example: http://rubular.com/r/tQtbWacl3g
Solution B:
^((?!localhost).)*$
Example: http://rubular.com/r/2KKnQZUMwf
The problem is that I don't understand what they're doing. For example, according to the docs, ^ can be used in various ways:
[^abc] Any single character except: a, b, or c
^ Start of line
But I don't get how it's being applied here.
Can someone breakdown these expressions for me, and how they differ from one another?

In both of your cases, ^ is just the start of the line (since it's not used inside a character class). Since both ^ and the lookahead are zero-width assertions, we can switch them around in the first case - I think that makes it a bit easier to explain:
^(?!.*localhost).*$
The ^ anchors the expression to the beginning of the string. The lookahead then starts from that position and tries to find localhost anywhere the string (the "anywhere" is taken care of by the .* in front of localhost). If that localhost can be found, the subexpression of the lookahead matches and therefore the negative lookahead causes the pattern to fail. Since the lookahead is bound to start at the beginning of the string by the adjacent ^ this means, the pattern overall cannot match. If, however the .*localhost does not match (and hence localhost does not occur in the string), the lookahead succeeds, and the .*$ simply takes care of matching the rest of the string.
Now the other one
^((?!localhost).)*$
This time the lookahead only checks at the current position (there is no .* inside it). But the lookahead is repeated for every single character. This way it does check every single position again. Here is roughly what happens: the ^ makes sure that we're starting at the beginning of the string again. The lookahead checks whether the word localhost is found at that position. If not, all is well, and . consumes one character. The * then repeats both of those steps. We are now one character further in the string, and the lookahead checks whether the second character starts the word localhost - again, if not, all is well, and . consumes another character. This is done for every single character in the string, until we reach the end.
In this particular case both methods are equivalent, and you could select one based on performance (if it matters) or readability (if not; probably the first one). However, in other cases the second variant is preferable, because it allows you to do this repetition for a fixed part of the string, whereas the first variant will always check the entire string.

You can get them easily explained online. The first:
NODE EXPLANATION
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
localhost 'localhost'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
--------------------------------------------------------------------------------
' '
And the second:
NODE EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
( group and capture to \1 (0 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
localhost 'localhost'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
. any character except \n
--------------------------------------------------------------------------------
)* end of \1 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
--------------------------------------------------------------------------------

As an aside comment, these two solutions are slow. A better way is to use:
^(?:[^l]+|l(?!ocalhost))+
In other words: all characters that are not a l or a l not followed by ocalhost
This will give you a better result since you don't have to check each positions. (For an url like http://localhost:1234/toto this kind of pattern will fail in ~15 steps vs ~50 steps for the two other patterns)
You can improve this pattern using atomic groups and possessive quantifiers to forbid backtracks:
^(?>[^l]++|l(?!ocalhost))++
Note that in your particular case you can speed up your pattern considering that you only want to check the host part of the url. Example:
^http:\/\/(?>[^l\s\/]++|l(?!ocalhost))++(?>\/\S*+|$)

according to the docs, ^ can be used in various ways:
[^abc] Any single character except: a, b, or c
^ Start of line
But I don't get how it's being applied here.
In the regex
(?!.*localhost)^.*$
The ^ is not inside any brackets, so the second one applies. Here is a trivial example:
/^x/
That regex says to match the start of the line, followed by the letter x. So it will match lines like this:
xcellent
x-ray
However, the regex will not match the lines:
axb
excellent
...because the x does not appear directly after the start of the line. You may wonder why 'axb' doesn't match. After all 'a' is the start of the line, and it is followed by an 'x'. However, 'start of the line' is just to the left of the first character, like this:
|
V
axb
^ is called a zero-width match because it matches the slim sliver just to the left of the 'a', e.g. between the starting quote mark and the 'a' in "axb". There's not really any space there, so ^ matches something that is 0 width.
Here is another example:
/x^/
That says to match the character x followed by the start of the line. Well, no line can have an x first and then the start of the line second, so that won't ever match anything.
Now your regex:
(?!.*localhost)^.*$
Like the 'start of line' ^, a lookahead is zero-width. What that means is that the lookahead scans the string looking for the match, but when it finds the match, it comes back to the beginning of the string, and then looks for the rest of the regex:
^.*$
One word of advice, when a regex requires lookarounds(lookaheads or lookbehinds), 99% of the time there are easier ways to do what you want. For instance, you could write:
url = "....."
if url.index('http') == 0
#then the line starts with 'http'
else
#the line doesn't start with http
end
That's much easier to read, and it doesn't require trying to decipher a complex regex.

Related

Why does this regex not match numbers and single letters?

Why does this regex not match 3a?
(\/\d{1,4}?|\d{1,4}?|\d{1,4}[A-z]{1})
Using \d{1,4}\D{1}, the result is the same.
Streets numbers:
/1
78
3a
89/
-1 (special case)
1
https://regex101.com/r/cYCafR/3
The digits+letter combination is not matched due to the order of alternatives in your pattern. The \d{1,4}? matches the digit before the letter, and \d{1,4}[A-z]{1} does not even have a chance to step in. See the Remember That The Regex Engine Is Eager article.
The \/\d{1,4}? will match a / and a single digit after the slash, and \d{1,4}? will always match a single digit, as {min,max}? is a lazy range/interval/limiting quantifier and as such only matches as few chars as possible. See Laziness Instead of Greediness.
Besides, [A-z] is a typo, it should be [A-Za-z].
It seems you want
\d{1,4}[A-Za-z]|\/?\d{1,4}
See the regex demo. If it should be at the start of a line, use
^(?:\d{1,4}[A-Za-z]|\/?\d{1,4})
See this regex demo.
Details
^ - start of a line
(?: - start of a non-capturing group
\d{1,4}[A-Za-z] - 1 to 4 digits and an ASCII letter
| - or
\/? - an optional /
\d{1,4} - 1 to 4 digits
) - end of the group.
Your regex uses lazy quantifiers like {1,4}?. These will match one character, and stop, because the rest of the pattern (i.e. nothing) matches the rest of the string. See here for how greedy vs lazy quantifiers work.
Another reason is that you put the \d{1,4}[A-z]{1} case last. This case will only be tried if the first two cases don't match. With 3a, the 3 already matches the second case, so the last case won't be considered.
You seem to just want:
^(\d{1,4}[A-Za-z]|\/?\d{1,4})
Note how the \/\d{1,4} case and the \d{1,4} case in your original regex are combined into one case \/?\d{1,4}.

How to understand gsub(/^.*\//, '') or the regex

Breaking up the below code to understand my regex and gsub understanding:
str = "abc/def/ghi.rb"
str = str.gsub(/^.*\//, '')
#str = ghi.rb
^ : beginning of the string
\/ : escape character for /
^.*\/ : everything from beginning to the last occurrence of / in the string
Is my understanding of the expression right?
How does .* work exactly?
Your general understanding is correct. The entire regex will match abc/def/ and String#gsub will replace it with empty string.
However, note that String#gsub doesn't change the string in place. This means that str will contain the original value("abc/def/ghi.rb") after the substitution. To change it in place, you can use String#gsub!.
As to how .* works - the algorithm the regex engine uses is called backtracking. Since .* is greedy (will try to match as many characters as possible), you can think that something like this will happen:
Step 1: .* matches the entire string abc/def/ghi.rb. Afterwards \/ tries to match a forward slash, but fails (nothing is left to match). .* has to backtrack.
Step 2: .* matches the entire string except the last character - abc/def/ghi.r. Afterwards \/ tries to match a forward slash, but fails (/ != b). .* has to backtrack.
Step 3: .* matches the entire string except the last two characters - abc/def/ghi.. Afterwards \/ tries to match a forward slash, but fails (/ != r). .* has to backtrack.
...
Step n: .* matches abc/def. Afterwards \/ tries to match a forward slash and succeeds. The matching ends here.
No, not quite.
^: beginning of a line
\/: escaped slash (escape character is \ alone)
^.*\/ : everything from beginning of a line to the last occurrence of / in the string
.* depends on the mode of the regex. In singleline mode (i.e., without m option), it means the longest possible sequence of zero or more non-newline characters. In multiline mode (i.e., with m option), it means the longest possible sequence of zero or more characters.
Your understanding is correct, but you should also note that the last statement is true because:
Repetition is greedy by default: as many occurrences as possible
are matched while still allowing the overall match to succeed.
Quoted from the Regexp documentation.
Yes. In short, it matches any number of any characters (.*) ending with a literal / (\/).
gsub replaces the match with the second argument (empty string '').
Nothing wrong with your regex, but File.basename(str) might be more appropriate.
To expound on what #Stefen said: It really looks like you're dealing with a file path, and that makes your question an XY problem where you're asking about Y when you should ask about X: Rather than how to use and understand a regex, the question should be what tool is used to manage paths.
Instead of rolling your own code, use code already written that comes with the language:
str = "abc/def/ghi.rb"
File.basename(str) # => "ghi.rb"
File.dirname(str) # => "abc/def"
File.split(str) # => ["abc/def", "ghi.rb"]
The reason you want to take advantage of File's built-in code is it takes into account the difference between directory delimiters in *nix-style OSes and Windows. At start-up, Ruby checks the OS and sets the File::SEPARATOR constant to what the OS needs:
File::SEPARATOR # => "/"
If your code moves from one system to another it will continue working if you use the built-in methods, whereas using a regex will immediately break because the delimiter will be wrong.

Regex to match certain conditions

Basically I want a regex to match this conditions
First 8 characters should be within [a-zA-Z]
Followed by any number of digits
Followed by any word character but not immediately folowed by "or" or "and"
I current have this regex:
^(?i:([a-z]{1,8})(\d+)((?!or|and).)+)$
this works fine for the following example:
ABCDEFGH1ZZZ
GFEDCBAH99ZZZ99
but NOT with this one because I think if satisfy "OR" in the "FORALL":
WOLRDWAR2FORALL
Expected output:
AAAAAAAA100NANDROID - should match
AAAAAAAA100ANDROID - should not match
AAAAAAAA100OR - should not match
AAAAAAAA100AND - should not match
Basically I don't want the FOR match the OR, any solution for my problem? btw, this is for Ruby
The problem with #anubhava regex and the others like it, is that
its too liberal using .* after the assertion.
That means it can split the expression before the assertion then
pick it up on the other side.
For example ^(?i:([a-z]{8})(\d+)((?!or|and).*))$ easily matches AAAAAAAA100AND
This is a rare case that causes the engine to backtrack a digit, to satisfy the assertion.
Usually, if .* were not used, it would be unnecessary to be concerned.
This can be fixed by injecting a \d* construct in the assertion.
Be aware that assertions are stand alone, they will match first then check if it should fail second. But this does not prevent the engine from backtracking if it can.
^(?i:([a-z]{8})(\d+)((?!\d*(?:or|and)).*))$
Expanded:
^
(?i:
( [a-z]{8} ) # (1)
( \d+ ) # (2)
( # (3 start)
(?!
\d*
(?: or | and )
)
.*
) # (3 end)
)
$
You can tweak your regex as:
/^(?i:([a-z]{8})(\d+)((?!or|and).*))$/
RegEx Demo
I think you are looking for this (I am using a positive look-behind (?<=\d) so that we only exclude or or and that are preceded by a digit):
^(?i:([a-z]{1,8})(\d+)((?!(?<=\d)(?:or|and)).)+)$
See demo
anubhava's answer seems to match the correct values, but all of the previous answers seem to include one or more capture groups, which I didn't see requested in your original post. Here's another possible solution that will match the entire string without groups:
^(?i:[a-z]{8}\d+(?!or|and).*)$
Rubular Demo

Regex matching plus or minus

Could someone please look at the following function and explain the regex for me as I don't understand it and I don't like using something I don't understand as then I won't be able to replicate it for use in the future and nor do I learn from it.
Also can someone explain the double !! in front, I know single means not so does double mean not "not"?
The function is a patch to String to check if it's capable of being converted to an integer or not.
class String
def is_i?
!!(self =~ /\A[-+]?[0-9]+\z/)
end
end
The main thing that's giving me trouble is [-+] as it makes little sense to me, if you could explain in the context given it would be very helpful.
EDIT:
Since people missed the second part of the question I'll be a little more explicit.
What does !! Mean in front of the check, I know a single ! means NOT but I can't find what !! means.
The [-+] Character Class
[-+] is a character class. It means "match one character specified by the class", i.e. - or +.
Hyphens in Character Classes
I can see how this particular class can be confusing because the hyphen often plays a special role in a character class: it links two characters to form a character range. For instance, [a-z] means "match one character between a and z, and [a-z0-9] means "match one character between a and z or between 0 and 9.
However, in this case, the hypen in [-+] is positioned in a place where it cannot be used to specify a range, and the - is just a literal hyphen.
Decoding the entire expression
Assert position at the beginning of the string \A
Match a single character from the list “-+” [-+]?
Between zero and one times, as many times as possible, giving back as needed (greedy) ?
Match a single character in the range between “0” and “9” [0-9]+
Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
Assert position at the very end of the string \z
A Character Class defines a set of characters, any one of which can occur in a string for a match to succeed.
For example, the regular expression [-+]?[0-9]+ will match 123, -123, or +123 because it defines a character class (accepting either -, +, or neither one) as its first character.
In context:
\A asserts position at the start of the string.
[-+] any character of: - or + (? optional, meaning between zero and one time)
[0-9] any character of: 0 to 9 (+ quantifier meaning 1 or more times)
\z asserts position at the very end of the string.
What does !! mean?
!! placed together converts the value to a boolean.
explain the regex for me as I don't understand it
Pattern explanation: \A[-+]?[0-9]+\z
\A Start of string
[-+]? plus or minus sign [zero or one time (optional)]
[0-9]+ 0 to 9 any digit [one or more times]
\z End of string
The above regex pattern is able to match any positive and negative integer number that has + or - sign optional.
Read more about Character Classes and test your regex pattern online at Rubular

Regex - Matching text AFTER certain characters

I want to scrape data from some text and dump it into an array. Consider the following text as example data:
| Example Data
| Title: This is a sample title
| Content: This is sample content
| Date: 12/21/2012
I am currently using the following regex to scrape the data that is specified after the 'colon' character:
/((?=:).+)/
Unfortunately this regex also grabs the colon and the space after the colon. How do I only grab the data?
Also, I'm not sure if I'm doing this right.. but it appears as though the outside parens causes a match to return an array. Is this the function of the parens?
EDIT: I'm using Rubular to test out my regex expressions
You could change it to:
/: (.+)/
and grab the contents of group 1. A lookbehind works too, though, and does just what you're asking:
/(?<=: ).+/
In addition to #minitech's answer, you can also make a 3rd variation:
/(?<=: ?)(.+)/
The difference here being, you create/grab the group using a look-behind.
If you still prefer the look-ahead rather than look-behind concept. . .
/(?=: ?(.+))/
This will place a grouping around your existing regex where it will catch it within a group.
And yes, the outside parenthesis in your code will make a match. Compare that to the latter example I gave where the entire look-ahead is 'grouped' rather than needlessly using a /( ... )/ without the /(?= ... )/, since the first result in most regular expression engines return the entire matched string.
I know you are asking for regex but I just saw the regex solution and found that it is rather hard to read for those unfamiliar with regex.
I'm also using Ruby and I decided to do it with:
line_as_string.split(": ")[-1]
This does what you require and IMHO it's far more readable.
For a very long string it might be inefficient. But not for this purpose.
In Ruby, as in PCRE and Boost, you may make use of the \K match reset operator:
\K keeps the text matched so far out of the overall regex match. h\Kd matches only the second d in adhd.
So, you may use
/:[[:blank:]]*\K.+/ # To only match horizontal whitespaces with `[[:blank:]]`
/:\s*\K.+/ # To match any whitespace with `\s`
Seee the Rubular demo #1 and the Rubular demo #2 and
Details
: - a colon
[[:blank:]]* - 0 or more horizontal whitespace chars
\K - match reset operator discarding the text matched so far from the overall match memory buffer
.+ - matches and consumes any 1 or more chars other than line break chars (use /m modifier to match any chars including line break chars).

Resources