What regex can I use to get the domain name from a url in Ruby? - ruby

I am trying to construct a regex to extract a domain given a url.
for:
http://www.abc.google.com/
http://abc.google.com/
https://www.abc.google.com/
http://abc.google.com/
should give:
abc.google.com

URI.parse('http://www.abc.google.com/').host
#=> "www.abc.google.com"
Not a regex, but probably more robust then anything we come up with here.
URI.parse('http://www.abc.google.com/').host.gsub(/^www\./, '')
If you want to remove the www. as well this will work without raising any errors if the www. is not there.

Don't know much about ruby but this regex pattern gives you the last 3 parts of the url excluding the trailing slash with a minumum of 2 characters per part.
([\w-]{2,}\.[\w-]{2,}\.[\w-]{2,})/$

you may be able to use the domain_name gem for this kind of work. From the README:
require "domain_name"
host = DomainName("a.b.example.co.uk")
host.domain #=> "example.co.uk"

Your question is a little bit vague. Can you give a precise specification of what it is exactly that you want to do? (Preferable with a testsuite.) Right now, all your question says is that you want a method that always returns 'abc.google.com'. That's easy:
def extract_domain
return 'abc.google.com'
end
But that's probably not what you meant …
Also, you say that you need a Regexp. Why? What's wrong with, for example, using the URI class? After all, parsing and manipulating URIs is exactly what it was made for!
require 'uri'
URI.parse('https://abc.google.com/').host # => 'abc.google.com'
And lastly, you say you are "trying to extract a domain", but you never specify what you mean by "domain". It looks you are sometimes meaning the FQDN and sometimes randomly dropping parts of the FQDN, but according to what rules? For example, for the FQDN abc.google.com, the domain name is google.com and the host name is abc, but you want it to return abc.google.com which is not just the domain name but the full FQDN. Why?

Related

Remove specific parts from url

Lets suppose I have a url like this:
https://www.youtube.com/watch/3e4345?v=rwmEkvPBG1s
What is the best and shorthest way to only get the 3e4345 part?
Sometimes it doesn't contain additional params in ?
I don't want to use any gems.
What I did was:
url = url.split('/watch/')
url = url[1].split('/')[0].split('?')[0]
Is there a better way? Thanks
possibly the safest and best one. use URI.
URI("https://www.youtube.com/watch/34345?v=rwmEkvPBG1s").path.split("/").last
For more refer How to extract URL parameters from a URL with Ruby or Rails?
You could do the following and using the match function to find a match based on a regular expression statement. The value at [1] is the first capture from the regular expression. I have included a breakdown from regexper.com to help illustrate what the expression is accomplishing.
You will notice parentheses around the \d+ which are what captures the digits out of the URL when it matches.
url.to_s.match(/\/watch\/(\d+).*$/)[1]
x = "https://www.youtube.com/watch/34345?v=rwmEkvPBG1s"
File.basename(URI(x).path)
=> "34345"

ruby regex for removing url prefix and ending

I've been trying to figure this out, and i've searched but I'm stuck.
Lets say I have the string www.google.com or http://google.com or just google.com
and I want to extract the string google out of those parameters.
A solution I can think of is first removing the first parameters (www.) then removing the second section of the string (.com) but I know there is a similar more efficient way.
any help would be greatly appreciated!
First, start with a tool designed to work with URLs. Ruby includes URI, and there's also Addressable::URI.
Using these you can strip down a URI into its defined components:
require 'uri'
uri = URI.parse('http://www.ruby-doc.org/stdlib-2.1.1/libdoc/uri/rdoc/URI.html')
uri.host # => "www.ruby-doc.org"
If your string doesn't start with a scheme, you can add one. (Schemes are important.)
url = 'foo.bar.com/some/path'
URI.parse('http://' + url).host
# => "foo.bar.com"
From that point you're going to have a tough time determining what is the true host, versus the domain. A domain can be anything (pretty much) and the host can be the domain name. Possibly you can get a list of domains but, remember that the list is constantly changing.
ICANN has a list of TLDs, as does IANA. Those are ONLY the top-level-domains, not the hosts that sit under them. However, using those lists you can strip the TLD from a host, and at least be a tiny bit closer to where you want to be.

How to return file path without url link?

I have
http://foobar.s3.amazonaws.com/uploads/users/15/photos/12/foo.jpg
How do I return
uploads/users/15/photos/12/foo.jpg
It is better to use the URI parsing that is part of the Ruby standard library
than to experiment with some regular expression that may or may not take every
possible special case into account.
require 'uri'
url = "http://foo.s3.amazonaws.com/uploads/users/15/photos/12/foo.jpg"
path = URI.parse(url).path
# => "/uploads/users/15/photos/12/foo.jpg"
path[1..-1]
# => "uploads/users/15/photos/12/foo.jpg"
No need to reinvent the wheel.
"http://foobar.s3.amazonaws.com/uploads/users/15/photos/12/foo.jpg".sub("http://foobar.s3.amazonaws.com/","")
would be an explicit version, in which you substitute the homepage-part with an empty string.
For a more universal approach I would recommend a regular expression, similar to this one:
string = "http://foobar.s3.amazonaws.com/uploads/users/15/photos/12/foo.jpg"
string.sub(/(http:\/\/)*.*?\.\w{2,3}\//,"")
If it's needed, I could explain the regular expression.
link = "http://foobar.s3.amazonaws.com/uploads/users/15/photos/12/foo.jpg"
path = link.match /\/\/[^\/]*\/(.*)/
path[1]
#=> "uploads/users/15/photos/12/foo.jpg"
Someone recommended this approach as well:
URI.parse(URI.escape('http://foobar.s3.amazonaws.com/uploads/users/15/photos/12/foo.jpg')).path[1..-1]
Are there any disadvantages using something like this versus a regexp approach?
The cheap answer is to just strip everything before the first single /.
Better answers are "How do I process a URL in ruby to extract the component parts (scheme, username, password, host, etc)?" and "Remove subdomain from string in ruby".

Ruby regex: extract a list of urls from a string

I have a string of images' URLs and I need to convert it into an array.
http://rubular.com/r/E2a5v2hYnJ
How do I do this?
URI.extract(your_string)
That's all you need if you already have it in a string. I can't remember, but you may have to put require 'uri' in there first. Gotta love that standard library!
Here's the link to the docs URI#extract
Scan returns an array
myarray = mystring.scan(/regex/)
See here on regular-expressions.info
The best answer will depend very much on exactly what input string you expect.
If your test string is accurate then I would not use a regex, do this instead (as suggested by Marnen Laibow-Koser):
mystring.split('?v=3')
If you really don't have constant fluff between your useful strings then regex might be better. Your regex is greedy. This will get you part way:
mystring.scan(/https?:\/\/[\w.-\/]*?\.(jpe?g|gif|png)/)
Note the '?' after the '*' in the part capturing the server and path pieces of the URL, this makes the regex non-greedy.
The problem with this is that if your server name or path contains any of .jpg, .jpeg, .gif or .png then the result will be wrong in that instance.
Figuring out what is best needs more information about your input string. You might for example find it better to pattern match the fluff between your desired URLs.
Use String#split (see the docs for details).
Part of the problem is in rubular you are using https instead of http.. this gets you closer to what you want if the other answers don't work for you:
http://rubular.com/r/cIjmjxIfz5

Ruby RegEx issue

I'm having a problem getting my RegEx to work with my Ruby script.
Here is what I'm trying to match:
http://my.test.website.com/{GUID}/{GUID}/
Here is the RegEx that I've tested and should be matching the string as shown above:
/([-a-zA-Z0-9#:%_\+.~#?&\/\/=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9#:%_\+.~#?&\/\/=]*)([\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/])*?\/)/
3 capturing groups:
group 1: ([-a-zA-Z0-9#:%_\+.~#?&\/\/=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9#:%_\+.~#?&\/\/=]*)([\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/])*?\/)
group 2: (\/[-a-zA-Z0-9#:%_\+.~#?&\/\/=]*)
group 3: ([\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/])
Ruby is giving me an error when trying to validate a match against this regex:
empty range in char class: (My RegEx goes here) (SyntaxError)
I appreciate any thoughts or suggestions on this.
You could simplify things a bit by using URI to deal parsing the URL, \h in the regex, and scan to pull out the GUIDs:
uri = URI.parse(your_url)
path = uri.path
guids = path.scan(/\h{8}-\h{4}-\h{4}-\h{4}-\h{12}/)
If you need any of the non-path components of the URL the you can easily pull them out of uri.
You might need to tighten things up a bit depending on your data or it might be sufficient to check that guids has two elements.
You have several errors in your RegEx. I am very sleepy now, so I'll just give you a hint instead of a solution:
...[\/\/[0-9a-fA-F]....
the first [ does not belong there. Also, having \/\/ inside [] is unnecessary - you only need each character once inside []. Also,
...[-a-zA-Z0-9#:%_\+.~#?&\/\/=]{2,256}...
is greedy, and includes a period - indeed, includes all chars (AFAICS) that can come after it, effectively swallowing the whole string (when you get rid of other bugs). Consider {2,256}? instead.

Resources