Remove all but the website name from URL in Ruby [duplicate] - ruby

This question already has answers here:
How to parse a URL and extract the required substring
(4 answers)
Closed 5 years ago.
Im a iterating through a list of URLs. The urls come in different formats like:
https://twitter.com/sdfaskj...
https://www.linkedin.com/asdkfjasd...
http://google.com/asdfjasdj...
etc.
I would like to use Gsub or something similar to erase everything but the name of the website, to get only "twitter", "linkedin", and "google", respectively.
In my head, ideally I would like something like a .gsub that can check for multiple possibilities (url.gsub("https:// or https://www. or http:// etc.", "") and replace them when found with nothing "". Also it needs to delete everything after the name, so ".com/wkadslflj..."
attributes.css("a").each do |attribute|
attribute_url = attribute["href"]
attribute_scrape = attribute_url.gsub("https://", "")
binding.pry
end

I would consider a combination of URI.parse to get the hostname from the URL and the PublicSuffix gem to get the second level domain:
require 'public_suffix'
require 'uri'
url = 'https://www.linkedin.com/asdkfjasd'
host = URI.parse(url).host # => 'www.linkedin.com'
PublicSuffix.parse(host).sld # => 'linkedin'

You can use this gsub regexp :
gsub(/http(s)?:\/\/(www.)?|.(com|net|co.uk|us)+.*/, '')
Output:
list = ["https://twitter.com/sdfaskj...", "https://www.linkedin.com/asdkfjasd...", "http://google.com/asdfjasdj..."]
list.map { |u| u.gsub(/http(s)?:\/\/(www.)?|.(com|net|co.uk|us)+.*/, '') }
=> ["twitter", "linkedin", "google"]

Related

Correct User's Inputted URL - Ruby [duplicate]

This question already has answers here:
Make user-inputted url into external link in rails
(2 answers)
Closed 8 years ago.
I have a url input box, but I'm thinking I'm going to change it because users still don't always enter in the url correctly as shown in an example and the url still allows a few variations. I was wondering how I could turn a user entered url into the format I want. Say I want the following format in the end:
http://www.example.com/
but the user enters in one of the following
www.example.com
www.example.com/
http://www.example.com
The other way would be if they didn't use a subdomain so the end result needs to be:
http://example.com
and they type in either:
example.com
example.com/
http://example.com
The code I need should be able to handle correctly formatting any formatting mistakes to get it into the desired format.
def correct(url, protocol='http')
url = url.sub(%r{^https?://}, '')
protocol = $& || "#{protocol}://"
#url = url.chomp('/') + '/' # Uncomment to ensure the url to ends with `/`
url = 'www.' + url unless url.start_with? 'www.'
protocol + url
end
correct('www.example.com') # "http://www.example.com"
correct('www.example.com/') # "http://www.example.com/"
correct('http://www.example.com') # "http://www.example.com"
correct('example.com') # "http://www.example.com"
correct(' example.com/') # "http://www. example.com/"
correct('http://example.com') # "http://www.example.com"
correct('https://example.com') # "https://www.example.com"
I have same problem and my solution in Rails
in application_helper.rb:
def address(link)
return link if link.scan(/(https:\/\/)|(http:\/\/)/).any?
link.split('').unshift('http://').join('')
end
in views:
= link_to "", address(user.website)
in pry:
=> helper.address("http://www.example.com/")
=> "http://www.example.com/"
=> helper.address("example.com")
=> "http://example.com"
you should use the domainatrix gem:
require 'domainatrix'
str = 'example.com'
url = Domainatrix.parse(str).url #=> "http://example.com"

What is a regex to check to see if some text contains only URLs?

I'm trying to make a regular expression that checks if some text only contains urls and whitespaces and nothing else so:
http://www.google.com http://www.stackoverflow.com
would match, but:
http://www.google.com and http://www.stackoverflow.com
would not match.
Is this possible?
you can use this regex (only test if that is between spaces begin with http://):
/^(?:https?:\/\/\S++\s*+)++$/ =~ text
Ruby already has a method to extract URLs, so that's a great starting place, rather than reinventing a working wheel:
require 'uri'
[
'http://www.google.com http://www.stackoverflow.com',
'http://www.google.com and http://www.stackoverflow.com'
].each do |url|
print url
if url.split.all? { |u| !URI.extract(u).empty? }
puts " contains only URLs"
else
puts " doesn't contain only URLs"
end
end
Which, after running, is:
http://www.google.com http://www.stackoverflow.com contains only URLs
http://www.google.com and http://www.stackoverflow.com doesn't contain only URLs
This doesn't support all the recognized URL schemes, but it is a starting point. You can specify which you want by passing an array of schemes to extract. You can get the IANA's permanent list using:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open('http://www.iana.org/assignments/uri-schemes.html'))
schemes = doc.at('table table').search('tr').map{ |tr| tr.at('td').text }[1..-1]
words.split.all? { |word| word.match(/^http:/) }
This will check for any URL and the string should be URLs with single white-space as URLs separator only
Look at this live demo
(((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)\s){1,}((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)$
Reference:
http://www.regular-expressions.info/reference.html
http://regexlib.com/Search.aspx?k=URL&AspxAutoDetectCookieSupport=1
If you really want to use regex, please try this:
(?< protocol>\w+):\/\/(?< domain>[\w#][\w.:#]+)\/?[\w\.?=%&=\-#/$,]*
Please remove the space before 'protocol' and 'domain'.
Split the string with the whitespaces, and check each string if it is match with the regex above.
Hope it helps!

How to parse a URL and extract the required substring

Say I have a string like this: "http://something.example.com/directory/"
What I want to do is to parse this string, and extract the "something" from the string.
The first step, is to obviously check to make sure that the string contains "http://" - otherwise, it should ignore the string.
But, how do I then just extract the "something" in that string? Assume that all the strings that this will be evaluating will have a similar structure (i.e. I am trying to extract the subdomain of the URL - if the string being examined is indeed a valid URL - where valid is starts with "http://").
Thanks.
P.S. I know how to check the first part, i.e. I can just simply split the string at the "http://" but that doesn't solve the full problem because that will produce "http://something.example.com/directory/". All I want is the "something", nothing else.
I'd do it this way:
require 'uri'
uri = URI.parse('http://something.example.com/directory/')
uri.host.split('.').first
=> "something"
URI is built into Ruby. It's not the most full-featured but it's plenty capable of doing this task for most URLs. If you have IRIs then look at Addressable::URI.
You could use URI like
uri = URI.parse("http://something.example.com/directory/")
puts uri.host
# "something.example.com"
and you could then just work on the host.
Or there is a gem domainatrix from Remove subdomain from string in ruby
require 'rubygems'
require 'domainatrix'
url = Domainatrix.parse("http://foo.bar.pauldix.co.uk/asdf.html?q=arg")
url.public_suffix # => "co.uk"
url.domain # => "pauldix"
url.subdomain # => "foo.bar"
url.path # => "/asdf.html?q=arg"
url.canonical # => "uk.co.pauldix.bar.foo/asdf.html?q=arg"
and you could just take the subdomain.
Well, you can use regular expressions.
Something like /http:\/\/([^\.]+)/, that is, the first group of non '.' letters after http.
Check out http://rubular.com/. You can test your regular expressions against a set of tests too, it's great for learning this tool.
with URI.parse you can get:
require "uri"
uri = URI.parse("http://localhost:3000")
uri.scheme # http
uri.host # localhost
uri.port # 3000

Extract all urls inside a string in Ruby

I have some text content with a list of URLs contained in it.
I am trying to grab all the URLs out and put them in an array.
I have this code
content = "Here is the list of URLs: http://www.google.com http://www.google.com/index.html"
urls = content.scan(/^(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(([0-9]{1,5})?\/.*)?$/ix)
I am trying to get the end results to be:
['http://www.google.com', 'http://www.google.com/index.html']
The above code does not seem to be working correctly. Does anyone know what I am doing wrong?
Thanks
Easy:
ruby-1.9.2-p136 :006 > require 'uri'
ruby-1.9.2-p136 :006 > URI.extract(content, ['http', 'https'])
=> ["http://www.google.com", "http://www.google.com/index.html"]
A different approach, from the perfect-is-the-enemy-of-the-good school of thought:
urls = content.split(/\s+/).find_all { |u| u =~ /^https?:/ }
I haven't checked the syntax of your regex, but String.scan will produce an array, each of whose members is an array of the groups matched by your regex. So I'd expect the result to be:
[['http', '.google.com'], ...]
You'll need non-matching groups /(?:stuff)/ if you want the format you've given.
Edit (looking at regex): Also, your regex does look a bit wrong. You don't want the start and end anchors (^ and $), since you don't expect the matches to be at start and end of content. Secondly, if your ([0-9]{1,5})? is trying to capture a port number, I think you're missing a colon to separate the domain from the port.
Further edit, after playing: I think you want something like this:
content = "Here is the list of URLs: http://www.google.com http://www.google.com/index.html http://example.com:3000/foo"
urls = content.scan(/(?:http|https):\/\/[a-z0-9]+(?:[\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(?:(?::[0-9]{1,5})?\/[^\s]*)?/ix)
# => ["http://www.google.com", "http://www.google.com/index.html", "http://example.com:3000/foo"]
... but note that it won't match pure IP-address URLs (like http://127.0.0.1), because of the [a-z]{2,5} for the TLD.
just for your interest:
Ruby has an URI Module, which has a regex implemented to do such things:
require "uri"
uris_you_want_to_grap = ['ftp','http','https','ftp','mailto','see']
html_string.scan(URI.regexp(uris_you_want_to_grap)) do |*matches|
urls << $&
end
For more information visit the Ruby Ref: URI
The most upvoted answer was causing issues with Markdown URLs for me, so I had to figure out a regex to extract URLs. Below is what I use:
URL_REGEX = /(https?:\/\/\S+?)(?:[\s)]|$)/i
content.scan(URL_REGEX).flatten
The last part here (?:[\s)]|$) is used to identify the end of the URL and you can add characters there as per your need and content. Right now it looks for any space characters, closing bracket or end of string.
content = "link in text [link1](http://www.example.com/test) and [link2](http://www.example.com/test2)
http://www.example.com/test3
http://www.example.com/test4"
returns ["http://www.example.com/test", "http://www.example.com/test2", "http://www.example.com/test3", "http://www.example.com/test4"].

How to check if a URL is valid

How can I check if a string is a valid URL?
For example:
http://hello.it => yes
http:||bra.ziz, => no
If this is a valid URL how can I check if this is relative to a image file?
Notice:
As pointed by #CGuess, there's a bug with this issue and it's been documented for over 9 years now that validation is not the purpose of this regular expression (see https://bugs.ruby-lang.org/issues/6520).
Use the URI module distributed with Ruby:
require 'uri'
if url =~ URI::regexp
# Correct URL
end
Like Alexander Günther said in the comments, it checks if a string contains a URL.
To check if the string is a URL, use:
url =~ /\A#{URI::regexp}\z/
If you only want to check for web URLs (http or https), use this:
url =~ /\A#{URI::regexp(['http', 'https'])}\z/
Similar to the answers above, I find using this regex to be slightly more accurate:
URI::DEFAULT_PARSER.regexp[:ABS_URI]
That will invalidate URLs with spaces, as opposed to URI.regexp which allows spaces for some reason.
I have recently found a shortcut that is provided for the different URI rgexps. You can access any of URI::DEFAULT_PARSER.regexp.keys directly from URI::#{key}.
For example, the :ABS_URI regexp can be accessed from URI::ABS_URI.
The problem with the current answers is that a URI is not an URL.
A URI can be further classified as a locator, a name, or both. The
term "Uniform Resource Locator" (URL) refers to the subset of URIs
that, in addition to identifying a resource, provide a means of
locating the resource by describing its primary access mechanism
(e.g., its network "location").
Since URLs are a subset of URIs, it is clear that matching specifically for URIs will successfully match undesired values. For example, URNs:
"urn:isbn:0451450523" =~ URI::regexp
=> 0
That being said, as far as I know, Ruby doesn't have a default way to parse URLs , so you'll most likely need a gem to do so. If you need to match URLs specifically in HTTP or HTTPS format, you could do something like this:
uri = URI.parse(my_possible_url)
if uri.kind_of?(URI::HTTP) or uri.kind_of?(URI::HTTPS)
# do your stuff
end
I prefer the Addressable gem. I have found that it handles URLs more intelligently.
require 'addressable/uri'
SCHEMES = %w(http https)
def valid_url?(url)
parsed = Addressable::URI.parse(url) or return false
SCHEMES.include?(parsed.scheme)
rescue Addressable::URI::InvalidURIError
false
end
This is a fairly old entry, but I thought I'd go ahead and contribute:
String.class_eval do
def is_valid_url?
uri = URI.parse self
uri.kind_of? URI::HTTP
rescue URI::InvalidURIError
false
end
end
Now you can do something like:
if "http://www.omg.wtf".is_valid_url?
p "huzzah!"
end
For me, I use this regular expression:
/\A(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?\z/ix
Option:
i - case insensitive
x - ignore whitespace in regex
You can set this method to check URL validation:
def valid_url?(url)
return false if url.include?("<script")
url_regexp = /\A(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?\z/ix
url =~ url_regexp ? true : false
end
To use it:
valid_url?("http://stackoverflow.com/questions/1805761/check-if-url-is-valid-ruby")
Testing with wrong URLs:
http://ruby3arabi - result is invalid
http://http://ruby3arabi.com - result is invalid
http:// - result is invalid
http://test.com\n<script src=\"nasty.js\"> (Just simply check "<script")
127.0.0.1 - not support IP address
Test with correct URLs:
http://ruby3arabi.com - result is valid
http://www.ruby3arabi.com - result is valid
https://www.ruby3arabi.com - result is valid
https://www.ruby3arabi.com/article/1 - result is valid
https://www.ruby3arabi.com/websites/58e212ff6d275e4bf9000000?locale=en - result is valid
In general,
/^#{URI::regexp}$/
will work well, but if you only want to match http or https, you can pass those in as options to the method:
/^#{URI::regexp(%w(http https))}$/
That tends to work a little better, if you want to reject protocols like ftp://.
This is a little bit old but here is how I do it. Use Ruby's URI module to parse the URL. If it can be parsed then it's a valid URL. (But that doesn't mean accessible.)
URI supports many schemes, plus you can add custom schemes yourself:
irb> uri = URI.parse "http://hello.it" rescue nil
=> #<URI::HTTP:0x10755c50 URL:http://hello.it>
irb> uri.instance_values
=> {"fragment"=>nil,
"registry"=>nil,
"scheme"=>"http",
"query"=>nil,
"port"=>80,
"path"=>"",
"host"=>"hello.it",
"password"=>nil,
"user"=>nil,
"opaque"=>nil}
irb> uri = URI.parse "http:||bra.ziz" rescue nil
=> nil
irb> uri = URI.parse "ssh://hello.it:5888" rescue nil
=> #<URI::Generic:0x105fe938 URL:ssh://hello.it:5888>
[26] pry(main)> uri.instance_values
=> {"fragment"=>nil,
"registry"=>nil,
"scheme"=>"ssh",
"query"=>nil,
"port"=>5888,
"path"=>"",
"host"=>"hello.it",
"password"=>nil,
"user"=>nil,
"opaque"=>nil}
See the documentation for more information about the URI module.
You could also use a regex, maybe something like http://www.geekzilla.co.uk/View2D3B0109-C1B2-4B4E-BFFD-E8088CBC85FD.htm assuming this regex is correct (I haven't fully checked it) the following will show the validity of the url.
url_regex = Regexp.new("((https?|ftp|file):((//)|(\\\\))+[\w\d:\##%/;$()~_?\+-=\\\\.&]*)")
urls = [
"http://hello.it",
"http:||bra.ziz"
]
urls.each { |url|
if url =~ url_regex then
puts "%s is valid" % url
else
puts "%s not valid" % url
end
}
The above example outputs:
http://hello.it is valid
http:||bra.ziz not valid

Resources