Ruby Regex optimization - ruby

Regex newbie here. I have a regular expression that matches Windows pathnames and UNC paths, terminated by '\'.
Working examples:
c:\windows\
c:\
\\server\share\
\\server\sh are\
Invalid:
c:\windows
\\server
\\server\share
\\server\ share \
However, it works as expected (at least i hope so), but it's pretty unreadable and not very performant, so any tips for optimization are greatly appreciated...
/\A(
([a-z]:\\(([a-zA-Z0-9äöüÄÖÜß_.$]+|[a-zA-Z0-9äöüÄÖÜß_.$]+[a-zA-Z0-9äöüÄÖÜß_.$\ ]*[a-zA-Z0-9äöüÄÖÜß_.$]+)\\)*)|
(\\\\(([a-zA-Z0-9äöüÄÖÜß_.$]+|[a-zA-Z0-9äöüÄÖÜß_.$]+[a-zA-Z0-9äöüÄÖÜß_.$\ ]*[a-zA-Z0-9äöüÄÖÜß_.$]+)\\)+(([a-zA-Z0-9äöüÄÖÜß_.$]+|
[a-zA-Z0-9äöüÄÖÜß_.$]+[a-zA-Z0-9äöüÄÖÜß_.$\ ]*[a-zA-Z0-9äöüÄÖÜß_.$]+)\\)+)
)\z/

In Ruby 1.9, the following should work:
if subject =~
/\A(?:(?!.*\\(?:con|prn|aux|nul|com\d|lpt\d)\\) # exclude invalid names
(?: # Either match
[a-z]:\\ # drive letter
| # or
\\\\(?:[^\\\/:*?"<>|\s]+\\){2} # UNC share name
) # End of alternation
(?: # Try to match:
(?!\s) # (Assert no starting space)
[^\\\/:*?"<>|\r\n]+ # a valid directory name
(?<!\s) # (Assert no ending space)
\\ # backslash
)* # repeat as needed
)\Z/mix
# Successful match
else
# Match attempt failed
end

Related

Detect opening and closing brackets

From a string:
"(book(1:3000))"
I need to exclude opening and closing brackets and match:
"book(1:3000)"
using regular expression.
I tried this regular expression:
/([^',()]+)|'([^']*)'/
which detects all characters and integers excluding brackets. The string detected by this regex is:
"book 1:3000"
Is there any regex that disregards the opening and closing brackets, and gives the entire string?
Build the regexp that explicitly states exactly what you want to extract: alphanumerics, followed by the opening parenthesis, followed by digits, followed by a colon, followed by digits, followed by closing parenthesis:
'(book(1:3000))'[/\w+\(\d+:\d+\)/]
#⇒ "book(1:3000)"
"(book(1:3000))"[/^\(?(.+?\))\)?/, 1]
=> "book(1:3000)"
"book(1:3000)"[/^\(?(.+?\))\)?/, 1]
=> "book(1:3000)"
The regex split on multiple lines for easier reading:
/
^ # start of string
\(? # character (, possibly (?)
( # start capturing
.+? # any characters forward until..
\) # ..a closing bracket
) # stop capturing
/x # end regex with /x modifier (allows splitting to lines)
1. Look for a possible ( in the beginning of string and ignore it.
2. Start capturing
3. Capture until and including the first )
But this is where it fails:
"book(1:(10*30))"[/^\(?(.+?\))\)?/, 1]
=> "book(1:(10*30)"
If you need something like that, you probably need to use a recursive regex as
described in another stackoverflow answer.

Why 'scan' reads multiple lines

My test configuration file(test_config.conf) looks as below
[DEFAULT]
system_name=
#test
flag=true
I want to read this and scan the value for key "system_name", with the expected output nil. I could have used config parser to read the contents, but using scan is my requirement.
I did:
File.read
Scan: file_data.scan(/^#{each}\s*=\s*(?!.*#)\s*(.*)/)
Regex: ^system_name\s*=\s*(?!.*#)\s*(.*)$
I used (?!.*#) to ignore the values that start with #.
It returns #test. Could someone help me understand why it does so, and how I can change my regex to make it work as expected?
It is another case of how backtracking confuses regex users. (?!.*#) negative lookahead must match a location that is not immediately followed with #. Since the preceding pattern part can match the string in various ways, once failed, the regex engine retries the quantified subpatterns. So, in your case, \s* matches 0 or more whitespaces. Once the regex engine matched all the whitespaces after =, it finds # - and fails. Then backtracks: tries to match zero whitespaces. And finds out that there is no # after =. And succeeds.
Use a possessive quantifier with \s*+ to disallow backtracking:
^system_name\s*=\s*+(?!#)(.*)$
^
See the Rubular demo. So, the lookahead will only be run once after all the 0+ whitespaces are matched. If it fails to match, the whole match will be failed right away.
Another way is to use [^\s#] negated character class:
^system_name\s*=\s*([^\s#].*)$
^^^^^^^
See another Rubular demo
Here, [^\s#] will only match a char that is not a whitespace, nor #, and then .* will match any 0+ chars other than line break chars.
As per the feedback inside comments, the structure of the input may be rather loose, and a key=value can follow the system_name line. In that case, you also need to make sure the text you capture does not actually start with some word chars followed with = sign:
/^system_name\s*=\s*+(?!#|\w+=)(.*)$/
See this Rubular demo
Full pattern details:
^ - start of a line
system_name - a literal substring
\s* - 0 or more whitespaces
= - an equal sign
\s*+ - 0 or more whitespaces with no backtracking into the pattern due to *+ possessive quantifier
(?!#|\w+=) - a negative lookahead that fails the match if the # or 1+ word chars and then = are found immediately to the right of the current location (that is right after the 0+ whitespaces)
(.*) - Group 1: any 0+ chars up to the end of the line
$ - end of a line.

Regex - Validate Email Domain and Full email

Here is the regex that I have:
\Ame\..*$
And I want it to match on:
me.com
me.ca
Bill#me.com
Bill.Smith#me.com
It also must not match on:
me.you#mean.com
me.you#foo
Currently it only matches the domain and not the full email.
I am using ruby for this.
I have been using http://rubular.com/ to try and solve this.
The following works, if I understand your requirements correctly:
\bme\.[^.#]*\z
Explanation:
\b # Match the start of a word
me # Match "me"
\. # Match "."
[^.#]* # Match any string unless it contains a "." or a "#"
\z # Match the end of the string
(I used \z instead of $ as I did on the Rubular example because that also matches the end of a line).

Ruby regex too greedy with back to back matches

I'm working on some text processing in Ruby 1.8.7 to support some custom shortcodes that I've created. Here are some examples of my shortcode:
[CODE first-part]
[CODE first-part second-part]
I'm using the following RegEx to grab the
text.gsub!( /\[CODE (\S+)\s?(\S?)\]/i, replacementText )
The problem is this: the regex doesn't work on the following text:
[CODE first-part][CODE first-part-again]
The results are as follows:
1. first-part][CODE
2. first-part-again
It seems that the \s? is the problematic part of the regex that is searching on until it hits the last space, not the first one. When I change the regex to the following:
\[CODE ([\w-]+)\s?(\S*)\]/i
It works fine. The only concern I have is what all \w vs \s as I want to make sure the \w will match URL-safe characters.
I'm sure there's a perfectly valid explanation, but it's eluding me. Any ideas? Thanks!
Actually, thinking about it, just using [^\]] might not be enough, as it will swallow up all spaces as well. You also need to exclude those:
/\[CODE[ ]([^\]\s]+)\s?([^\]\s]*)\]/i
Note the [ ] - I just think it makes literal spaces more readable.
Working demo.
Explained in free-spacing mode:
\[CODE[ ] # match your identifier
( # capturing group 1
[^\]\s]+ # match one or more non-], non-whitespace characters
) # end of group 1
\s? # match an optional whitespace character
( # capturing group 2
[^\]\s]+ # match zero or more non-], non-whitespace characters
) # end of group 2
\] # match the closing ]
As none of the character classes in the pattern includes ], you can never possibly go beyond the end of the square bracketed expression.
By the way, if you find unnecessary escapes in regex as obscuring as I do, here is the minimal version:
/\[CODE[ ]([^]\s]+)\s?([^]\s]*)]/i
But that is definitely a matter of taste.
The problem was with the greedy \S+ in this
/\[CODE (\S+)\s?(\S?)\]/i
You could try:
/\[CODE (\S+?)\s?(\S?)\]/i
but actually your new character class is IMO superiror.
Even better might be:
/\[CODE ([^\]]+?)\s?([^\]]*)\]/i

How to modify this regex to exclude punctuation in a URL?

I've modified a regex that I found here so that it would accept various UK and second-level TLDs.
/\b((?:^https?:\/\/|^[a-z0-9.\-]+[.][a-z]{2,4})(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!#()\[\]{};:'".,<>?]))/i
However as you can see in my test data here, the regex matches URLs such as www.zapple.#com and https://m!crosoft.com which are not valid.
For some reason # symbols are excluded before the .com but after the . they are not.
Exclamation marks are not excluded at all which is confusing since, as far as I can see, only letters, numbers and dashes are allowed before the period.
The # is matched by
[^\s()<>]+
And the ! mark by
(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+
I don't know but that doesn't look like a good regex to match url's
Try the following which matches a url according to RFC 3986
Both absolute and relative url'sare supported.
Set case insensitivity to true
^
(# Scheme
[a-z][a-z0-9+\-.]*:
(# Authority & path
//
([a-z0-9\-._~%!$&'()*+,;=]+#)? # User
([a-z0-9\-._~%]+ # Named host
|\[[a-f0-9:.]+\] # IPv6 host
|\[v[a-f0-9][a-z0-9\-._~%!$&'()*+,;=:]+\]) # IPvFuture host
(:[0-9]+)? # Port
(/[a-z0-9\-._~%!$&'()*+,;=:#]+)*/? # Path
|# Path without authority
(/?[a-z0-9\-._~%!$&'()*+,;=:#]+(/[a-z0-9\-._~%!$&'()*+,;=:#]+)*/?)?
)
|# Relative URL (no scheme or authority)
([a-z0-9\-._~%!$&'()*+,;=#]+(/[a-z0-9\-._~%!$&'()*+,;=:#]+)*/? # Relative path
|(/[a-z0-9\-._~%!$&'()*+,;=:#]+)+/?) # Absolute path
)
# Query
(\?[a-z0-9\-._~%!$&'()*+,;=:#/?]*)?
# Fragment
(\#[a-z0-9\-._~%!$&'()*+,;=:#/?]*)?
$
Update 1
This does not match m!crosoft.com and #pple.com It's probably due to someting with Rublar.

Resources