Regex newbie here. I have a regular expression that matches Windows pathnames and UNC paths, terminated by '\'.
Working examples:
c:\windows\
c:\
\\server\share\
\\server\sh are\
Invalid:
c:\windows
\\server
\\server\share
\\server\ share \
However, it works as expected (at least i hope so), but it's pretty unreadable and not very performant, so any tips for optimization are greatly appreciated...
/\A(
([a-z]:\\(([a-zA-Z0-9äöüÄÖÜß_.$]+|[a-zA-Z0-9äöüÄÖÜß_.$]+[a-zA-Z0-9äöüÄÖÜß_.$\ ]*[a-zA-Z0-9äöüÄÖÜß_.$]+)\\)*)|
(\\\\(([a-zA-Z0-9äöüÄÖÜß_.$]+|[a-zA-Z0-9äöüÄÖÜß_.$]+[a-zA-Z0-9äöüÄÖÜß_.$\ ]*[a-zA-Z0-9äöüÄÖÜß_.$]+)\\)+(([a-zA-Z0-9äöüÄÖÜß_.$]+|
[a-zA-Z0-9äöüÄÖÜß_.$]+[a-zA-Z0-9äöüÄÖÜß_.$\ ]*[a-zA-Z0-9äöüÄÖÜß_.$]+)\\)+)
)\z/
In Ruby 1.9, the following should work:
if subject =~
/\A(?:(?!.*\\(?:con|prn|aux|nul|com\d|lpt\d)\\) # exclude invalid names
(?: # Either match
[a-z]:\\ # drive letter
| # or
\\\\(?:[^\\\/:*?"<>|\s]+\\){2} # UNC share name
) # End of alternation
(?: # Try to match:
(?!\s) # (Assert no starting space)
[^\\\/:*?"<>|\r\n]+ # a valid directory name
(?<!\s) # (Assert no ending space)
\\ # backslash
)* # repeat as needed
)\Z/mix
# Successful match
else
# Match attempt failed
end
Related
From a string:
"(book(1:3000))"
I need to exclude opening and closing brackets and match:
"book(1:3000)"
using regular expression.
I tried this regular expression:
/([^',()]+)|'([^']*)'/
which detects all characters and integers excluding brackets. The string detected by this regex is:
"book 1:3000"
Is there any regex that disregards the opening and closing brackets, and gives the entire string?
Build the regexp that explicitly states exactly what you want to extract: alphanumerics, followed by the opening parenthesis, followed by digits, followed by a colon, followed by digits, followed by closing parenthesis:
'(book(1:3000))'[/\w+\(\d+:\d+\)/]
#⇒ "book(1:3000)"
"(book(1:3000))"[/^\(?(.+?\))\)?/, 1]
=> "book(1:3000)"
"book(1:3000)"[/^\(?(.+?\))\)?/, 1]
=> "book(1:3000)"
The regex split on multiple lines for easier reading:
/
^ # start of string
\(? # character (, possibly (?)
( # start capturing
.+? # any characters forward until..
\) # ..a closing bracket
) # stop capturing
/x # end regex with /x modifier (allows splitting to lines)
1. Look for a possible ( in the beginning of string and ignore it.
2. Start capturing
3. Capture until and including the first )
But this is where it fails:
"book(1:(10*30))"[/^\(?(.+?\))\)?/, 1]
=> "book(1:(10*30)"
If you need something like that, you probably need to use a recursive regex as
described in another stackoverflow answer.
My test configuration file(test_config.conf) looks as below
[DEFAULT]
system_name=
#test
flag=true
I want to read this and scan the value for key "system_name", with the expected output nil. I could have used config parser to read the contents, but using scan is my requirement.
I did:
File.read
Scan: file_data.scan(/^#{each}\s*=\s*(?!.*#)\s*(.*)/)
Regex: ^system_name\s*=\s*(?!.*#)\s*(.*)$
I used (?!.*#) to ignore the values that start with #.
It returns #test. Could someone help me understand why it does so, and how I can change my regex to make it work as expected?
It is another case of how backtracking confuses regex users. (?!.*#) negative lookahead must match a location that is not immediately followed with #. Since the preceding pattern part can match the string in various ways, once failed, the regex engine retries the quantified subpatterns. So, in your case, \s* matches 0 or more whitespaces. Once the regex engine matched all the whitespaces after =, it finds # - and fails. Then backtracks: tries to match zero whitespaces. And finds out that there is no # after =. And succeeds.
Use a possessive quantifier with \s*+ to disallow backtracking:
^system_name\s*=\s*+(?!#)(.*)$
^
See the Rubular demo. So, the lookahead will only be run once after all the 0+ whitespaces are matched. If it fails to match, the whole match will be failed right away.
Another way is to use [^\s#] negated character class:
^system_name\s*=\s*([^\s#].*)$
^^^^^^^
See another Rubular demo
Here, [^\s#] will only match a char that is not a whitespace, nor #, and then .* will match any 0+ chars other than line break chars.
As per the feedback inside comments, the structure of the input may be rather loose, and a key=value can follow the system_name line. In that case, you also need to make sure the text you capture does not actually start with some word chars followed with = sign:
/^system_name\s*=\s*+(?!#|\w+=)(.*)$/
See this Rubular demo
Full pattern details:
^ - start of a line
system_name - a literal substring
\s* - 0 or more whitespaces
= - an equal sign
\s*+ - 0 or more whitespaces with no backtracking into the pattern due to *+ possessive quantifier
(?!#|\w+=) - a negative lookahead that fails the match if the # or 1+ word chars and then = are found immediately to the right of the current location (that is right after the 0+ whitespaces)
(.*) - Group 1: any 0+ chars up to the end of the line
$ - end of a line.
Here is the regex that I have:
\Ame\..*$
And I want it to match on:
me.com
me.ca
Bill#me.com
Bill.Smith#me.com
It also must not match on:
me.you#mean.com
me.you#foo
Currently it only matches the domain and not the full email.
I am using ruby for this.
I have been using http://rubular.com/ to try and solve this.
The following works, if I understand your requirements correctly:
\bme\.[^.#]*\z
Explanation:
\b # Match the start of a word
me # Match "me"
\. # Match "."
[^.#]* # Match any string unless it contains a "." or a "#"
\z # Match the end of the string
(I used \z instead of $ as I did on the Rubular example because that also matches the end of a line).
I'm working on some text processing in Ruby 1.8.7 to support some custom shortcodes that I've created. Here are some examples of my shortcode:
[CODE first-part]
[CODE first-part second-part]
I'm using the following RegEx to grab the
text.gsub!( /\[CODE (\S+)\s?(\S?)\]/i, replacementText )
The problem is this: the regex doesn't work on the following text:
[CODE first-part][CODE first-part-again]
The results are as follows:
1. first-part][CODE
2. first-part-again
It seems that the \s? is the problematic part of the regex that is searching on until it hits the last space, not the first one. When I change the regex to the following:
\[CODE ([\w-]+)\s?(\S*)\]/i
It works fine. The only concern I have is what all \w vs \s as I want to make sure the \w will match URL-safe characters.
I'm sure there's a perfectly valid explanation, but it's eluding me. Any ideas? Thanks!
Actually, thinking about it, just using [^\]] might not be enough, as it will swallow up all spaces as well. You also need to exclude those:
/\[CODE[ ]([^\]\s]+)\s?([^\]\s]*)\]/i
Note the [ ] - I just think it makes literal spaces more readable.
Working demo.
Explained in free-spacing mode:
\[CODE[ ] # match your identifier
( # capturing group 1
[^\]\s]+ # match one or more non-], non-whitespace characters
) # end of group 1
\s? # match an optional whitespace character
( # capturing group 2
[^\]\s]+ # match zero or more non-], non-whitespace characters
) # end of group 2
\] # match the closing ]
As none of the character classes in the pattern includes ], you can never possibly go beyond the end of the square bracketed expression.
By the way, if you find unnecessary escapes in regex as obscuring as I do, here is the minimal version:
/\[CODE[ ]([^]\s]+)\s?([^]\s]*)]/i
But that is definitely a matter of taste.
The problem was with the greedy \S+ in this
/\[CODE (\S+)\s?(\S?)\]/i
You could try:
/\[CODE (\S+?)\s?(\S?)\]/i
but actually your new character class is IMO superiror.
Even better might be:
/\[CODE ([^\]]+?)\s?([^\]]*)\]/i
I've modified a regex that I found here so that it would accept various UK and second-level TLDs.
/\b((?:^https?:\/\/|^[a-z0-9.\-]+[.][a-z]{2,4})(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!#()\[\]{};:'".,<>?]))/i
However as you can see in my test data here, the regex matches URLs such as www.zapple.#com and https://m!crosoft.com which are not valid.
For some reason # symbols are excluded before the .com but after the . they are not.
Exclamation marks are not excluded at all which is confusing since, as far as I can see, only letters, numbers and dashes are allowed before the period.
The # is matched by
[^\s()<>]+
And the ! mark by
(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+
I don't know but that doesn't look like a good regex to match url's
Try the following which matches a url according to RFC 3986
Both absolute and relative url'sare supported.
Set case insensitivity to true
^
(# Scheme
[a-z][a-z0-9+\-.]*:
(# Authority & path
//
([a-z0-9\-._~%!$&'()*+,;=]+#)? # User
([a-z0-9\-._~%]+ # Named host
|\[[a-f0-9:.]+\] # IPv6 host
|\[v[a-f0-9][a-z0-9\-._~%!$&'()*+,;=:]+\]) # IPvFuture host
(:[0-9]+)? # Port
(/[a-z0-9\-._~%!$&'()*+,;=:#]+)*/? # Path
|# Path without authority
(/?[a-z0-9\-._~%!$&'()*+,;=:#]+(/[a-z0-9\-._~%!$&'()*+,;=:#]+)*/?)?
)
|# Relative URL (no scheme or authority)
([a-z0-9\-._~%!$&'()*+,;=#]+(/[a-z0-9\-._~%!$&'()*+,;=:#]+)*/? # Relative path
|(/[a-z0-9\-._~%!$&'()*+,;=:#]+)+/?) # Absolute path
)
# Query
(\?[a-z0-9\-._~%!$&'()*+,;=:#/?]*)?
# Fragment
(\#[a-z0-9\-._~%!$&'()*+,;=:#/?]*)?
$
Update 1
This does not match m!crosoft.com and #pple.com It's probably due to someting with Rublar.