Using regex, how could I remove everything before the first path / in a URL?
Example URL: https://www.example.com/some/page?user=1&email=joe#schmoe.org
From that, I just want /some/page?user=1&email=joe#schmoe.org
In the case that it's just the root domain (ie. https://www.example.com/), then I just want the / to be returned.
The domain may or may not have a subdomain and it may or may not have a secure protocol. Really ultimately just wanting to strip out anything before that first path slash.
In the event that it matters, I'm running Ruby 1.9.3.
Don't use regex for this. Use the URI class. You can write:
require 'uri'
u = URI.parse('https://www.example.com/some/page?user=1&email=joe#schmoe.org')
u.path #=> "/some/page"
u.query #=> "user=1&email=joe#schmoe.org"
# All together - this will only return path if query is empty (no ?)
u.request_uri #=> "/some/page?user=1&email=joe#schmoe.org"
require 'uri'
uri = URI.parse("https://www.example.com/some/page?user=1&email=joe#schmoe.org")
> uri.path + '?' + uri.query
=> "/some/page?user=1&email=joe#schmoe.org"
As Gavin also mentioned, it's not a good idea to use RegExp for this, although it's tempting.
You could have URLs with special characters, even UniCode characters in them, which you did not expect when you wrote the RegExp. This can particularly happen in your query string. Using the URI library is the safer approach.
The same can be done using String#index
index(substring[, offset])
str = "https://www.example.com/some/page?user=1&email=joe#schmoe.org"
offset = str.index("//") # => 6
str[str.index('/',offset + 2)..-1]
# => "/some/page?user=1&email=joe#schmoe.org"
I strongly agree with the advice to use the URI module in this case, and I don't consider myself great with regular expressions. Still, it seems worthwhile to demonstrate one possible way to do what you ask.
test_url1 = 'https://www.example.com/some/page?user=1&email=joe#schmoe.org'
test_url2 = 'http://test.com/'
test_url3 = 'http://test.com'
regex = /^https?:\/\/[^\/]+(.*)/
regex.match(test_url1)[1]
# => "/some/page?user=1&email=joe#schmoe.org"
regex.match(test_url2)[1]
# => "/"
regex.match(test_url3)[1]
# => ""
Note that in the last case, the URL had no trailing '/' so the result is the empty string.
The regular expression (/^https?:\/\/[^\/]+(.*)/) says the string starts with (^) http (http), optionally followed by s (s?), followed by :// (:\/\/) followed by at least one non-slash character ([^\/]+), followed by zero or more characters, and we want to capture those characters ((.*)).
I hope that you find that example and explanation educational, and I again recommend against actually using a regular expression in this case. The URI module is simpler to use and far more robust.
Related
How do I write a regex in ruby that will look for a "-" and ".org" or "com" like:
some-thing.org
some-thing.org.sg
some-thing.com
some-thing.com.sg
some-thing.com.* (there are too many countries so for now any suffix is fine- I will deal with this problem later )
but not:
some-thing
some-thing.moc
I wrote : /.-.(org)?|.*(.com)/i
but it fails to stop "some-thing" or "some-thing.moc" :(
Support optional hyphen
I can come with this regex:
(https?:\/\/)?(www\.)?[a-z0-9-]+\.(com|org)(\.[a-z]{2,3})?
Working demo
Keep in mind that I used capturing groups for simplicity, but if you want to avoid capturing the content you can use non capturing groups like this:
(?:https?:\/\/)?(?:www\.)?[a-z0-9-]+\.(?:com|org)(?:\.[a-z]{2,3})?
^--- Notice "?:" to use non capturing groups
Additionally, if you don't want to use protocol and www pattern you can use:
[a-z0-9-]+\.(?:com|org)(?:\.[a-z]{2,3})?
Support mandatory hyphen
However, as Greg Hewgill pointed in his comment, if you want to ensure you have a hyphen at least, you can use this regex:
(?:https?:\/\/)?(?:www\.)?[a-z0-9]+(?:[-][a-z0-9]+)+\.(?:com|org)(?:\.[a-z]{2,3})?
Although, this regex can fall in horrible backtracking issues.
Working demo
This may help :
/[a-z0-9]+-?[a-z0-9]+\.(org|com)(\.[a-z]+)?/i
It matches '-' in the middle optionally, i.e. still matches names without '-'.
I had a similar issue when I was writing an HTTP server...
... I ended up using the following Regexp:
m = url.match /(([a-z0-9A-Z]+):\/\/)?(([^\/\:]+))?(:([0-9]+))?([^\?\#]*)(\?([^\#]*))?/
m[1] # => requested_protocol (optional) - i.e. https, http, ws, ftp etc'
m[4] # => host_name (optional) - i.e. www.my-site.com
m[6] # => port (optional)
m[7] #=> encoded URI - i.e. /index.htm
If what you are trying to do is validate a host name, you can simply make sure it doesn't contain the few illegal characters (:, /) and contains at least one dot separated string.
If you want to validate only .com or .org (+ country codes), you can do something like this:
def is_legit_url?(url)
allowed_master_domains = %w{com org}
allowed_country_domains = %w{sg it uk}
url.match(/[^\/\:]+\.(#{allowed_master_domains.join '|'})(\.#{allowed_country_domains.join '|'})?/i) && true
end
* notice that certain countries use .co, i.e. the UK uses www.amazon.co.uk
I would convert the Regexp to a constant, for performance reasons:
module MyURLReview
ALLOWED_MASTER_DOMAINS = %w{com org}
ALLOWED_COUNTRY_DOMAINS = %w{sg it uk}
DOMAINS_REGEXP = /[^\/\:]+\.(#{ALLOWED_MASTER_DOMAINS.join '|'})(\.#{ALLOWED_COUNTRY_DOMAINS.join '|'})?/i
def self.is_legit_url?(url)
url.match(DOMAINS_REGEXP) && true
end
end
Good Luck!
Regex101 Example
/[a-zA-Z0-9]-[a-zA-Z0-9]+?\.(?:org|com)\.?/
Of course, the above could be simplified depending on how lenient your rules are. The following is a simpler pattern, but would allow s0me-th1ng.com-plete to pass through:
/\w-\w+?\.(?:org|com)\b/
You could use a lookahead:
^(?=[^.]+-[^.]+)([^.]+\.(?:org|com).*)
Demo
Assuming you are looking for the general pattern of letters-letters where letters could be Unicode, you can do:
^(?=\p{L}+-\p{L}+)([^.]+\.(?:org|com).*)
If you want to add digits:
^(?=[\p{L}0-9]+-[\p{L}0-9]+)([^.]+\.(?:org|com).*)
So that you can match sòme1-thing.com
Demo
(Ruby 2.0+ for \p{L} I think...)
So I have a very specific URL, that tends to always follow the following format:
http://mtc.cdn.vine.co/r/videos/0DCB6FF2EF1179983941847883776_38a153447e7.1.5.3901866229871838946.mp4?versionId=.k9_w6W7t1Yr1KUCWRIm6AnYhSdOUz32
Basically I want to grab everything from after the . and before the ?versionId as I imagine that's the consistent location of the file extension.
I currently have something like this where \.\.{0}(.+)\?versionId it is matching everything starting from the first . to versionId.
One solution I thought about doing was using the . as a delimiter. I've never tried to restrict a character, but basically I would want it to try to match everything starting with a ., reject anything that has a . leading up to the ?.
Anyone got any idea how to get this to work?
Is your goal to get 'mp4'? Might consider not using a regex at all...
> require 'uri'
> uri = URI.parse('http://mtc.cdn.vine.co/r/videos/0DCB6FF2EF1179983941847883776_38a153447e7.1.5.3901866229871838946.mp4?versionId=.k9_w6W7t1Yr1KUCWRIm6AnYhSdOUz32')
=> #<URI::HTTP http://mtc.cdn.vine.co/r/videos/0DCB6FF2EF1179983941847883776_38a153447e7.1.5.3901866229871838946.mp4?versionId=.k9_w6W7t1Yr1KUCWRIm6AnYhSdOUz32>
> uri.path
=> "/r/videos/0DCB6FF2EF1179983941847883776_38a153447e7.1.5.3901866229871838946.mp4"
> File.extname(uri.path)
=> ".mp4"
Completely in agreement with Philip Hallstrom, this is a typical XY problem. However, if you really wish to just hone your Regexp skills, the literal solution to your question is (Rubular):
(?<=\.)[^.]+(?=\?)
"From where a period was just before, match any number of non-periods, matching up to where question mark is just after."
To understand this, read up on positive lookbehind ((?<=...)), positive lookahead ((?=...)), and negated character sets ([^...]).
I have
http://foobar.s3.amazonaws.com/uploads/users/15/photos/12/foo.jpg
How do I return
uploads/users/15/photos/12/foo.jpg
It is better to use the URI parsing that is part of the Ruby standard library
than to experiment with some regular expression that may or may not take every
possible special case into account.
require 'uri'
url = "http://foo.s3.amazonaws.com/uploads/users/15/photos/12/foo.jpg"
path = URI.parse(url).path
# => "/uploads/users/15/photos/12/foo.jpg"
path[1..-1]
# => "uploads/users/15/photos/12/foo.jpg"
No need to reinvent the wheel.
"http://foobar.s3.amazonaws.com/uploads/users/15/photos/12/foo.jpg".sub("http://foobar.s3.amazonaws.com/","")
would be an explicit version, in which you substitute the homepage-part with an empty string.
For a more universal approach I would recommend a regular expression, similar to this one:
string = "http://foobar.s3.amazonaws.com/uploads/users/15/photos/12/foo.jpg"
string.sub(/(http:\/\/)*.*?\.\w{2,3}\//,"")
If it's needed, I could explain the regular expression.
link = "http://foobar.s3.amazonaws.com/uploads/users/15/photos/12/foo.jpg"
path = link.match /\/\/[^\/]*\/(.*)/
path[1]
#=> "uploads/users/15/photos/12/foo.jpg"
Someone recommended this approach as well:
URI.parse(URI.escape('http://foobar.s3.amazonaws.com/uploads/users/15/photos/12/foo.jpg')).path[1..-1]
Are there any disadvantages using something like this versus a regexp approach?
The cheap answer is to just strip everything before the first single /.
Better answers are "How do I process a URL in ruby to extract the component parts (scheme, username, password, host, etc)?" and "Remove subdomain from string in ruby".
I wrote a ruby script to process a large amount of documents and use the following URI to extract URIs from a document's string representation:
#Taken from: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
URI_REGEX = /
( # Capture 1: entire matched URL
(?:
[a-z][\w-]+: # URL protocol and colon
(?:
\/{1,3} # 1-3 slashes
| # or
[a-z0-9%] # Single letter or digit or '%'
)
| # or
www\d{0,3}[.] # "www.", "www1.", "www2." … "www999."
| # or
[a-z0-9.\-]+[.][a-z]{2,4}\/ # looks like domain name followed by a slash
)
(?: # One or more:
[^\s()<>]+ # Run of non-space, non-()<>
| # or
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
)+
(?: # End with:
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
| # or
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct chars
)
)/xi
It works pretty well for 99.9 percent of all documents but always hangs up my script when it encounters the following token in of the documents: token = "synsem:local:cat:(subcat:SubMot,adjuncts:Adjs,subj:Subj),"
I am using the standard ruby regexp oeprator: token =~ URI_REGEX and I don't get any exception or error message.
First I tried to solve the problem encapsulating the regex evaluation into a Timeout::timeoutblock, but this degrades performance to much.
Any other ideas on how to solve this problem?
Your problem is catastrophic backtracking. I just loaded your regex and your test string into RegexBuddy, and it gave up after 1.000.000 iterations of the regex engine (and from the looks of it, it would have gone on for many millions more had it not aborted).
The problem arises because some parts of your text can be matched by different parts of your regex (which is horribly complicated and painful to read); it seems that the "One or more:" part of your regex and the "End with:" part struggle over the match (when it's not working), trying out millions of permutations that all fail.
It's difficult to suggest a solution without knowing what the rules for matching a URI are (which I don't). All this balancing of parentheses suggests to me that regexes may not be the right tool for the job. Maybe you could break down the problem. First use a simple regex to find everything that looks remotely like a URI, then validate that in a second step (isn't there a URI parser for Ruby of some sort?).
Another thing you might be able to do is to prevent the regex engine from backtracking by using atomic groups. If you can change some (?:...) groups into (?>...) groups, that would allow the regex to fail faster by disallowing backtracking into those groups. However, that might change the match and make it fail on occasions where backtracking is necessary to achieve a match at all - so that's not always an option.
Why reinvent the wheel?
require 'uri'
uri_list = URI.extract("Text containing URIs.")
URI.extract("Text containing URIs.") is the best solution if you only need the URIs.
I finally used pat = URI::Parser.new.make_regexp('http')to get the built-in URI parsing regexp and use it in match = str.match(pat, start_pos) to iteratively parse the input text URI by URI. I am doing this because I also need the URI positions in the text and the returned match object gives me this information match.begin(0).
What I'm trying to achieve here is lets say we have two example URLs:
url1 = "http://emy.dod.com/kaskaa/dkaiad/amaa//////////"
url2 = "http://www.example.com/"
How can I extract the striped down URLs?
url1 = "http://emy.dod.com/kaskaa/dkaiad/amaa"
url2 = "http://http://www.example.com"
URI.parse in Ruby sanitizes certain type of malformed URL but is ineffective in this case.
If we use regex then /^(.*)\/$/ removes a single slash / from url1 and is ineffective for url2.
Is anybody aware of how to handle this type of URL parsing?
The point here is I don't want my system to have http://www.example.com/ and http://www.example.com being treated as two different URLs. And same goes for http://emy.dod.com/kaskaa/dkaiad/amaa//// and http://emy.dod.com/kaskaa/dkaiad/amaa/.
If you just need to remove all slashes from the end of the url string then you can try the following regex:
"http://emy.dod.com/kaskaa/dkaiad/amaa//////////".sub(/(\/)+$/,'')
"http://www.example.com/".sub(/(\/)+$/,'')
/(\/)+$/ - this regex finds one or more slashes at the end of the string. Then we replace this match with empty string.
Hope this helps.
Although this thread is a bit old and the top answer is quite good, but I suggest another way to do this:
/^(.*?)\/$/
You could see it in action here: https://regex101.com/r/vC6yX1/2
The magic here is *?, which does a lazy match. So the entire expression could be translated as:
Match as few characters as it can and capture it, while match as many slashes as it can at the end.
Which means, in a more plain English, removes all trailing slashes.
def without_trailing_slash path
path[ %r(.*[^/]) ]
end
path = "http://emy.dod.com/kaskaa/dkaiad/amaa//////////"
puts without_trailing_slash path # "http://emy.dod.com/kaskaa/dkaiad/amaa"