I've modified a regex that I found here so that it would accept various UK and second-level TLDs.
/\b((?:^https?:\/\/|^[a-z0-9.\-]+[.][a-z]{2,4})(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!#()\[\]{};:'".,<>?]))/i
However as you can see in my test data here, the regex matches URLs such as www.zapple.#com and https://m!crosoft.com which are not valid.
For some reason # symbols are excluded before the .com but after the . they are not.
Exclamation marks are not excluded at all which is confusing since, as far as I can see, only letters, numbers and dashes are allowed before the period.
The # is matched by
[^\s()<>]+
And the ! mark by
(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+
I don't know but that doesn't look like a good regex to match url's
Try the following which matches a url according to RFC 3986
Both absolute and relative url'sare supported.
Set case insensitivity to true
^
(# Scheme
[a-z][a-z0-9+\-.]*:
(# Authority & path
//
([a-z0-9\-._~%!$&'()*+,;=]+#)? # User
([a-z0-9\-._~%]+ # Named host
|\[[a-f0-9:.]+\] # IPv6 host
|\[v[a-f0-9][a-z0-9\-._~%!$&'()*+,;=:]+\]) # IPvFuture host
(:[0-9]+)? # Port
(/[a-z0-9\-._~%!$&'()*+,;=:#]+)*/? # Path
|# Path without authority
(/?[a-z0-9\-._~%!$&'()*+,;=:#]+(/[a-z0-9\-._~%!$&'()*+,;=:#]+)*/?)?
)
|# Relative URL (no scheme or authority)
([a-z0-9\-._~%!$&'()*+,;=#]+(/[a-z0-9\-._~%!$&'()*+,;=:#]+)*/? # Relative path
|(/[a-z0-9\-._~%!$&'()*+,;=:#]+)+/?) # Absolute path
)
# Query
(\?[a-z0-9\-._~%!$&'()*+,;=:#/?]*)?
# Fragment
(\#[a-z0-9\-._~%!$&'()*+,;=:#/?]*)?
$
Update 1
This does not match m!crosoft.com and #pple.com It's probably due to someting with Rublar.
Related
I'm trying to clean up some auto-generated code where input URL fragments:
may include spaces, which need to be %-escaped (as %20, not +)
may include other URL-invalid characters, which also need to be %-escaped
may include path separators, which need to be left alone (/)
may include already-escaped components, which need not to be doubly-escaped
The existing code uses libcurl (via Typhoeus and Ethon), which like command-line curl seems to happily accept spaces in URLs.
The existing code is all string-based and has a number of shenanigans involving removing extra slashes, adding missing slashes, etc. I'm trying to replace this with URI.join(), but this fails with bad URI(is not URI?) on the fragments with spaces.
The obvious solution is to use the (deprecated) URI.escape, which escapes spaces, but leaves slashes alone:
URI.escape('http://example.org/ spaces /<"punc^tu`ation">/non-ascïï 𝖈𝖍𝖆𝖗𝖘/&c.')
# => "http://example.org/%20spaces%20/%3C%22punc%5Etu%60ation%22%3E/non-asc%C3%AF%C3%AF%20%F0%9D%96%88%F0%9D%96%8D%F0%9D%96%86%F0%9D%96%97%F0%9D%96%98/%EF%BC%86%EF%BD%83%EF%BC%8E"
This mostly works, except for case (3) above — previously escaped components get double-escaped.
s1 = URI.escape(s)
# => "http://example.org/%20spaces%20/%3C%22punc%5Etu%60ation%22%3E/non-asc%C3%AF%C3%AF%20%F0%9D%96%88%F0%9D%96%8D%F0%9D%96%86%F0%9D%96%97%F0%9D%96%98/%EF%BC%86%EF%BD%83%EF%BC%8E"
URI.escape(s)
# => "http://example.org/%2520spaces%2520/%253C%2522punc%255Etu%2560ation%2522%253E/non-asc%25C3%25AF%25C3%25AF%2520%25F0%259D%2596%2588%25F0%259D%2596%258D%25F0%259D%2596%2586%25F0%259D%2596%2597%25F0%259D%2596%2598/%25EF%25BC%2586%25EF%25BD%2583%25EF%25BC%258E"
The recommended alternatives to URI.escape, e.g. CGI.escape and ERB::Util.url_encode, are not suitable as they mangle the slashes (among other problems):
CGI.escape(s)
# => "http%3A%2F%2Fexample.org%2F+spaces+%2F%3C%22punc%5Etu%60ation%22%3E%2Fnon-asc%C3%AF%C3%AF+%F0%9D%96%88%F0%9D%96%8D%F0%9D%96%86%F0%9D%96%97%F0%9D%96%98%2F%EF%BC%86%EF%BD%83%EF%BC%8E"
ERB::Util.url_encode(s)
# => "http%3A%2F%2Fexample.org%2F%20spaces%20%2F%3C%22punc%5Etu%60ation%22%3E%2Fnon-asc%C3%AF%C3%AF%20%F0%9D%96%88%F0%9D%96%8D%F0%9D%96%86%F0%9D%96%97%F0%9D%96%98%2F%EF%BC%86%EF%BD%83%EF%BC%8E"
Is there a clean, out-of-the-box way to preserve existing slashes, escapes, etc. and escape only invalid characters in a URI string?
So far the best I've been able to come up with is something like:
include URI::RFC2396_Parser::PATTERN
INVALID = Regexp.new("[^%#{RESERVED}#{UNRESERVED}]")
def escape_invalid(str)
parser = URI::RFC2396_Parser.new
parser.escape(str, INVALID)
end
This seems to work:
s2 = escape_invalid(s)
# => "http://example.org/%20spaces%20/%3C%22punc%5Etu%60ation%22%3E/non-asc%C3%AF%C3%AF%20%F0%9D%96%88%F0%9D%96%8D%F0%9D%96%86%F0%9D%96%97%F0%9D%96%98/%EF%BC%86%EF%BD%83%EF%BC%8E"
s2 == escape_invalid(s2)
# => true
but I'm not confident in the regex concatenation (even if it is the way URI::RFC2396_Parser works internally) and I know it doesn't handle all cases (e.g., a % that isn't part of a valid hex escape should probably be escaped). I'd much rather find a library standard solution.
I am using Pylint to go through a bunch of .py files and do the following tests:
bad-indentation
mixed-indentation
unused-variable
I also want to have my pylintrc file as minimalist as possible. This is what I had in the begging:
[MASTER]
# Use multiple processes to speed up Pylint.
jobs=1
# Pickle collected data for later comparisons.
persistent=yes
# Allow loading of arbitrary C extensions. Extensions are imported into the
# active Python interpreter and may run arbitrary code.
unsafe-load-any-extension=no
[MESSAGES CONTROL]
# Disable all to choose the Tests one by one
disable=all
# Tests
enable=bad-indentation, # Used when an unexpected number of indentation’s tabulations or spaces has been found.
mixed-indentation, # Used when there are some mixed tabs and spaces in a module.
unused-variable # Used when a variable is defined but not used. (Use _var to ignore var).
[REPORTS]
# Set the output format. Available formats are text, parseable, colorized, json
# and msvs (visual studio).You can also give a reporter class, eg
# mypackage.mymodule.MyReporterClass.
output-format=text
# Tells whether to display a full report or only the messages
reports=no
# Activate the evaluation score.
score=no
[REFACTORING]
# Maximum number of nested blocks for function / method body
max-nested-blocks=5
[TYPECHECK]
# List of decorators that produce context managers, such as
# contextlib.contextmanager. Add to this list to register other decorators that
# produce valid context managers.
contextmanager-decorators=contextlib.contextmanager
# Tells whether missing members accessed in mixin class should be ignored. A
# mixin class is detected if its name ends with "mixin" (case insensitive).
ignore-mixin-members=yes
# This flag controls whether pylint should warn about no-member and similar
# checks whenever an opaque object is returned when inferring. The inference
# can return multiple potential results while evaluating a Python object, but
# some branches might not be evaluated, which results in partial inference. In
# that case, it might be useful to still emit no-member and other checks for
# the rest of the inferred objects.
ignore-on-opaque-inference=yes
# List of class names for which member attributes should not be checked (useful
# for classes with dynamically set attributes). This supports the use of
# qualified names.
ignored-classes=optparse.Values,thread._local,_thread._local
# Show a hint with possible names when a member name was not found. The aspect
# of finding the hint is based on edit distance.
missing-member-hint=yes
# The minimum edit distance a name should have in order to be considered a
# similar match for a missing member name.
missing-member-hint-distance=1
# The total number of similar names that should be taken in consideration when
# showing a hint for a missing member.
missing-member-max-choices=1
[MISCELLANEOUS]
# List of note tags to take in consideration, separated by a comma.
notes=FIXME,XXX,TODO
[SIMILARITIES]
# Ignore comments when computing similarities.
ignore-comments=yes
# Ignore docstrings when computing similarities.
ignore-docstrings=yes
# Ignore imports when computing similarities.
ignore-imports=no
# Minimum lines number of a similarity.
min-similarity-lines=4
[LOGGING]
# Logging modules to check that the string format arguments are in logging
# function parameter format
logging-modules=logging
[BASIC]
# Naming hint for argument names
argument-name-hint=(([a-z][a-z0-9_]{2,30})|(_[a-z0-9_]*))$
# Regular expression matching correct argument names
argument-rgx=(([a-z][a-z0-9_]{2,30})|(_[a-z0-9_]*))$
# Naming hint for attribute names
attr-name-hint=(([a-z][a-z0-9_]{2,30})|(_[a-z0-9_]*))$
# Regular expression matching correct attribute names
attr-rgx=(([a-z][a-z0-9_]{2,30})|(_[a-z0-9_]*))$
# Bad variable names which should always be refused, separated by a comma
bad-names=foo,bar,baz,toto,tutu,tata
# Naming hint for class attribute names
class-attribute-name-hint=([A-Za-z_][A-Za-z0-9_]{2,30}|(__.*__))$
# Regular expression matching correct class attribute names
class-attribute-rgx=([A-Za-z_][A-Za-z0-9_]{2,30}|(__.*__))$
# Naming hint for class names
class-name-hint=[A-Z_][a-zA-Z0-9]+$
# Regular expression matching correct class names
class-rgx=[A-Z_][a-zA-Z0-9]+$
# Naming hint for constant names
const-name-hint=(([A-Z_][A-Z0-9_]*)|(__.*__))$
# Regular expression matching correct constant names
const-rgx=(([A-Z_][A-Z0-9_]*)|(__.*__))$
# Minimum line length for functions/classes that require docstrings, shorter
# ones are exempt.
docstring-min-length=-1
# Naming hint for function names
function-name-hint=(([a-z][a-z0-9_]{2,30})|(_[a-z0-9_]*))$
# Regular expression matching correct function names
function-rgx=(([a-z][a-z0-9_]{2,30})|(_[a-z0-9_]*))$
# Good variable names which should always be accepted, separated by a comma
good-names=i,j,k,ex,Run,_
# Include a hint for the correct naming format with invalid-name
include-naming-hint=no
# Naming hint for inline iteration names
inlinevar-name-hint=[A-Za-z_][A-Za-z0-9_]*$
# Regular expression matching correct inline iteration names
inlinevar-rgx=[A-Za-z_][A-Za-z0-9_]*$
# Naming hint for method names
method-name-hint=(([a-z][a-z0-9_]{2,30})|(_[a-z0-9_]*))$
# Regular expression matching correct method names
method-rgx=(([a-z][a-z0-9_]{2,30})|(_[a-z0-9_]*))$
# Naming hint for module names
module-name-hint=(([a-z_][a-z0-9_]*)|([A-Z][a-zA-Z0-9]+))$
# Regular expression matching correct module names
module-rgx=(([a-z_][a-z0-9_]*)|([A-Z][a-zA-Z0-9]+))$
# Regular expression which should only match function or class names that do
# not require a docstring.
no-docstring-rgx=^_
# List of decorators that produce properties, such as abc.abstractproperty. Add
# to this list to register other decorators that produce valid properties.
property-classes=abc.abstractproperty
# Naming hint for variable names
variable-name-hint=(([a-z][a-z0-9_]{2,30})|(_[a-z0-9_]*))$
# Regular expression matching correct variable names
variable-rgx=(([a-z][a-z0-9_]{2,30})|(_[a-z0-9_]*))$
[VARIABLES]
# List of additional names supposed to be defined in builtins. Remember that
# you should avoid to define new builtins when possible.
#additional-builtins=
# Tells whether unused global variables should be treated as a violation.
allow-global-unused-variables=yes
# List of strings which can identify a callback function by name. A callback
# name must start or end with one of those strings.
callbacks=cb_,_cb
# A regular expression matching the name of dummy variables (i.e. expectedly
# not used).
dummy-variables-rgx=_+$|(_[a-zA-Z0-9_]*[a-zA-Z0-9]+?$)|dummy|^ignored_|^unused_
# Argument names that match this expression will be ignored. Default to name
# with leading underscore
ignored-argument-names=_.*|^ignored_|^unused_
# Tells whether we should check for unused import in __init__ files.
init-import=no
# List of qualified module names which can have objects that can redefine
# builtins.
redefining-builtins-modules=six.moves,future.builtins
[SPELLING]
# Tells whether to store unknown words to indicated private dictionary in
# --spelling-private-dict-file option instead of raising a message.
spelling-store-unknown-words=no
[FORMAT]
# Regexp for a line that is allowed to be longer than the limit.
ignore-long-lines=^\s*(# )?<?https?://\S+>?$
# Number of spaces of indent required inside a hanging or continued line.
indent-after-paren=4
# String used as indentation unit. This is usually " " (4 spaces) or "\t" (1
# tab).
indent-string=' '
# Maximum number of characters on a single line.
max-line-length=125
# Maximum number of lines in a module
max-module-lines=1000
# List of optional constructs for which whitespace checking is disabled. `dict-
# separator` is used to allow tabulation in dicts, etc.: {1 : 1,\n222: 2}.
# `trailing-comma` allows a space between comma and closing bracket: (a, ).
# `empty-line` allows space-only lines.
no-space-check=trailing-comma,dict-separator
# Allow the body of a class to be on the same line as the declaration if body
# contains single statement.
single-line-class-stmt=no
# Allow the body of an if to be on the same line as the test if there is no
# else.
single-line-if-stmt=no
[CLASSES]
# List of method names used to declare (i.e. assign) instance attributes.
defining-attr-methods=__init__,__new__,setUp
# List of member names, which should be excluded from the protected access
# warning.
exclude-protected=_asdict,_fields,_replace,_source,_make
# List of valid names for the first argument in a class method.
valid-classmethod-first-arg=cls
# List of valid names for the first argument in a metaclass class method.
valid-metaclass-classmethod-first-arg=mcs
[IMPORTS]
# Allow wildcard imports from modules that define __all__.
allow-wildcard-with-all=no
# Analyse import fallback blocks. This can be used to support both Python 2 and
# 3 compatible code, which means that the block might have code that exists
# only in one or another interpreter, leading to false positives when analysed.
analyse-fallback-blocks=no
# Deprecated modules which should not be used, separated by a comma
deprecated-modules=regsub,TERMIOS,Bastion,rexec
# Force import order to recognize a module as part of a third party library.
known-third-party=enchant
[DESIGN]
# Maximum number of arguments for function / method
max-args=5
# Maximum number of attributes for a class (see R0902).
max-attributes=7
# Maximum number of boolean expressions in a if statement
max-bool-expr=5
# Maximum number of branch for function / method body
max-branches=12
# Maximum number of locals for function / method body
max-locals=15
# Maximum number of parents for a class (see R0901).
max-parents=7
# Maximum number of public methods for a class (see R0904).
max-public-methods=20
# Maximum number of return / yield for function / method body
max-returns=6
# Maximum number of statements in function / method body
max-statements=50
# Minimum number of public methods for a class (see R0903).
min-public-methods=2
[EXCEPTIONS]
# Exceptions that will emit a warning when being caught.
overgeneral-exceptions=Exception
Can I just simplify it to this:
[MASTER]
# Use multiple processes to speed up Pylint.
jobs=1
[MESSAGES CONTROL]
# Disable all to choose the Tests one by one
disable=all
# Tests
enable=bad-indentation, # Used when an unexpected number of indentation’s tabulations or spaces has been found.
mixed-indentation, # Used when there are some mixed tabs and spaces in a module.
unnecessary-semicolon, # Used when a statement is ended by a semi-colon (”;”), which isn’t necessary.
unused-variable # Used when a variable is defined but not used. (Use _var to ignore var).
[REPORTS]
# Tells whether to display a full report or only the messages
reports=no
# Activate the evaluation score.
score=no
[FORMAT]
# Regexp for a line that is allowed to be longer than the limit.
ignore-long-lines=^\s*(# )?<?https?://\S+>?$
# Number of spaces of indent required inside a hanging or continued line.
indent-after-paren=4
# String used as indentation unit. This is usually " " (4 spaces) or "\t" (1
# tab).
indent-string=' '
# Maximum number of lines in a module
max-module-lines=1000
[EXCEPTIONS]
# Exceptions that will emit a warning when being caught.
overgeneral-exceptions=Exception
I believe that if I'm only doing those tests many of the other lines were completely useless. What happens if I remove one of the lines? Does it get its default values?
For example if I set reports=no I don't need to have the line output-format=text, right? And if I remove jobs=1 line will it still be the default?
Yes, if you don't define any value in the pylint configuration, it is going to use its default known value.
When I use following code
var pageMod = require("sdk/page-mod");
pageMod.PageMod({
include: "http://www.page.com/user/*",
contentScript: 'window.alert("user");'
});
I get alert. But I want to replace "http://www." part so I tried:
*://*.page.com/user/*
*://page.com/user/*
*.page.com/user/*
and none of those work for me. Examples from developer.mozilla.org indicate that at least one of them should work. What is wrong with those?
I have encountered this problem in the past, you cannot use more than 1 * (wildcard) in the pattern.
You have 2 options
Use an array of websites, i.e. ["http://www.page.com/user/*", "https://www.page.com/user/*"]
Use a RegEx (Regular Expression)
Here is how you can use a RegEx to get what you wanted when you tried *://*.page.com/user/*
Use the following RegEx: .+:\/\/(.+\.)?page\.com\/user\/.*
Here is how it works (if you do not know RegEx, I would suggest learning it):
.+ # Any character 1+ times - Selects the Protocol (http, https, ftp)
:\/\/ # :// After Protocol (/ have to be escaped using \/)
(.+\.)? # (Optional) Letters followed by a . (dot) - (www.)
page # Website Name - (page)
\.com # .com - (Top-Level Domain)
\/user\/ # Folder /user/ (/ have to be escaped using \/)
.* # Any character 0 or more times - (Any folders / files after the /user/ folder)
Here is a good site to learn RegEx if you do not already know them: RegexOne
So, your full include will be:
include: /.+:\/\/(.+\.)?page\.com\/user\/.*/,
Note than in JavaScript you define a RegEx by esclosing it in /s
Here is a Live Demo of the RegEx working
Regex newbie here. I have a regular expression that matches Windows pathnames and UNC paths, terminated by '\'.
Working examples:
c:\windows\
c:\
\\server\share\
\\server\sh are\
Invalid:
c:\windows
\\server
\\server\share
\\server\ share \
However, it works as expected (at least i hope so), but it's pretty unreadable and not very performant, so any tips for optimization are greatly appreciated...
/\A(
([a-z]:\\(([a-zA-Z0-9äöüÄÖÜß_.$]+|[a-zA-Z0-9äöüÄÖÜß_.$]+[a-zA-Z0-9äöüÄÖÜß_.$\ ]*[a-zA-Z0-9äöüÄÖÜß_.$]+)\\)*)|
(\\\\(([a-zA-Z0-9äöüÄÖÜß_.$]+|[a-zA-Z0-9äöüÄÖÜß_.$]+[a-zA-Z0-9äöüÄÖÜß_.$\ ]*[a-zA-Z0-9äöüÄÖÜß_.$]+)\\)+(([a-zA-Z0-9äöüÄÖÜß_.$]+|
[a-zA-Z0-9äöüÄÖÜß_.$]+[a-zA-Z0-9äöüÄÖÜß_.$\ ]*[a-zA-Z0-9äöüÄÖÜß_.$]+)\\)+)
)\z/
In Ruby 1.9, the following should work:
if subject =~
/\A(?:(?!.*\\(?:con|prn|aux|nul|com\d|lpt\d)\\) # exclude invalid names
(?: # Either match
[a-z]:\\ # drive letter
| # or
\\\\(?:[^\\\/:*?"<>|\s]+\\){2} # UNC share name
) # End of alternation
(?: # Try to match:
(?!\s) # (Assert no starting space)
[^\\\/:*?"<>|\r\n]+ # a valid directory name
(?<!\s) # (Assert no ending space)
\\ # backslash
)* # repeat as needed
)\Z/mix
# Successful match
else
# Match attempt failed
end
I want to match urls in text and replace them with anchor tags, but I want to exclude some terminators just like how Twitter matches urls in tweets.
So far I've got this, but it's obviously not working too well.
(http[s]?\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?)
EDIT: Some example urls. In all cases below I only want to match "http://www.example.com"
http://www.example.com.
http://www.example.com:
"http://www.example.com"
http://www.example.com;
http://www.example.com!
[http://www.example.com]
{http://www.example.com}
http://www.example.com*
I looked into this very issue last year and developed a solution that you may want to look at - See: URL Linkification (HTTP/FTP) This link is a test page for the Javascript solution with many examples of difficult-to-linkify URLs.
My regex solution, written for both PHP and Javascript - (but could easily be translated to Ruby) is not simple (but neither is the problem as it turns out.) For more information I would recommend also reading:
The Problem With URLs by Jeff Atwood, and
An Improved Liberal, Accurate Regex Pattern for Matching URLs by John Gruber
The comments following Jeff's blog post are a must read if you want to do this right...
Ruby's URI module has a extract method that is used to parse out URLs from text. Parsing the returned values lets you piggyback on the heuristics in the module to extract the scheme and host information from a URL, avoiding reinventing the wheel.
text = '
http://www.example.com.
http://www.example.com:
"http://www.example.com"
http://www.example.com;
http://www.example.com!
[http://www.example.com]
{http://www.example.com}
http://www.example.com*
http://www.example.com/foo/bar?q=foobar
http://www.example.com:81
'
require 'uri'
puts URI::extract(text).map{ |u| uri = URI.parse(u); "#{ uri.scheme }://#{ uri.host[/(^.+?)\.?$/, 1] }" }
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
The only gotcha, is that a period '.' is a legitimate character in a host name, so URI#host won't strip it. Those get caught in the map statement where the URL is rebuilt. Note that URI is stripping off the path and query information.
A pragmatic and easy understandable solution is:
regex = %r!"(https?://[-.\w]+\.\w{2,6})"!
Some notes:
With %r we can choose the start and end delimiter. In this case I used exclamation mark, since I want to use slash unescaped in the regex.
The optional quantifier (i.e. '?') binds only to the preceding expression, in this case 's'. There's no need to put the 's' in a character class [s]?. It's the same as s?.
Inside the character class [-.\w] we don't need to escape dash and dot in order to make them match dot and dash literally. Dash should be first, however, to not mean range.
\w matches [A-Za-z0-9_] in Ruby. It's not exactly the full definition of URL characters, but combined with dash and dot it may be enough for our needs.
Top domains are between 2 and 6 characters long, e.g. '.se' and '.travel'
I'm not sure what you mean by I want to exclude some terminators but this regex matches only the wanted one in your example.
We want to use the first capture group, e.g. like this:
if input =~ %r!"(https?://[-.\w]+.\w{2,6})"!
match = $~[1]
else
match = ""
end
What about this?
%r|https?://[-\w.]*\w|