Correct User's Inputted URL - Ruby [duplicate] - ruby

This question already has answers here:
Make user-inputted url into external link in rails
(2 answers)
Closed 8 years ago.
I have a url input box, but I'm thinking I'm going to change it because users still don't always enter in the url correctly as shown in an example and the url still allows a few variations. I was wondering how I could turn a user entered url into the format I want. Say I want the following format in the end:
http://www.example.com/
but the user enters in one of the following
www.example.com
www.example.com/
http://www.example.com
The other way would be if they didn't use a subdomain so the end result needs to be:
http://example.com
and they type in either:
example.com
example.com/
http://example.com
The code I need should be able to handle correctly formatting any formatting mistakes to get it into the desired format.

def correct(url, protocol='http')
url = url.sub(%r{^https?://}, '')
protocol = $& || "#{protocol}://"
#url = url.chomp('/') + '/' # Uncomment to ensure the url to ends with `/`
url = 'www.' + url unless url.start_with? 'www.'
protocol + url
end
correct('www.example.com') # "http://www.example.com"
correct('www.example.com/') # "http://www.example.com/"
correct('http://www.example.com') # "http://www.example.com"
correct('example.com') # "http://www.example.com"
correct(' example.com/') # "http://www. example.com/"
correct('http://example.com') # "http://www.example.com"
correct('https://example.com') # "https://www.example.com"

I have same problem and my solution in Rails
in application_helper.rb:
def address(link)
return link if link.scan(/(https:\/\/)|(http:\/\/)/).any?
link.split('').unshift('http://').join('')
end
in views:
= link_to "", address(user.website)
in pry:
=> helper.address("http://www.example.com/")
=> "http://www.example.com/"
=> helper.address("example.com")
=> "http://example.com"

you should use the domainatrix gem:
require 'domainatrix'
str = 'example.com'
url = Domainatrix.parse(str).url #=> "http://example.com"

Related

Remove all but the website name from URL in Ruby [duplicate]

This question already has answers here:
How to parse a URL and extract the required substring
(4 answers)
Closed 5 years ago.
Im a iterating through a list of URLs. The urls come in different formats like:
https://twitter.com/sdfaskj...
https://www.linkedin.com/asdkfjasd...
http://google.com/asdfjasdj...
etc.
I would like to use Gsub or something similar to erase everything but the name of the website, to get only "twitter", "linkedin", and "google", respectively.
In my head, ideally I would like something like a .gsub that can check for multiple possibilities (url.gsub("https:// or https://www. or http:// etc.", "") and replace them when found with nothing "". Also it needs to delete everything after the name, so ".com/wkadslflj..."
attributes.css("a").each do |attribute|
attribute_url = attribute["href"]
attribute_scrape = attribute_url.gsub("https://", "")
binding.pry
end
I would consider a combination of URI.parse to get the hostname from the URL and the PublicSuffix gem to get the second level domain:
require 'public_suffix'
require 'uri'
url = 'https://www.linkedin.com/asdkfjasd'
host = URI.parse(url).host # => 'www.linkedin.com'
PublicSuffix.parse(host).sld # => 'linkedin'
You can use this gsub regexp :
gsub(/http(s)?:\/\/(www.)?|.(com|net|co.uk|us)+.*/, '')
Output:
list = ["https://twitter.com/sdfaskj...", "https://www.linkedin.com/asdkfjasd...", "http://google.com/asdfjasdj..."]
list.map { |u| u.gsub(/http(s)?:\/\/(www.)?|.(com|net|co.uk|us)+.*/, '') }
=> ["twitter", "linkedin", "google"]

Converting to valid urls which can be opened by open-uri

I need to open some webpages using open-uri in ruby and then parse the content of those pages using Nokogori.
I just did:
require 'open-uri'
content_file = open(user_input_url)
This worked for: http://www.google.co.in and http://google.co.in but fails when user give inputs like www.google.co.in or google.co.in.
One thing i can do for such inputs i can append http:// and https:// and return the content of the page that opens. But this seems like a big hack to me.
Is there any better way to achieve this in ruby(i.e converting these user_inputs to valid open_uri urls).
uri = URI("www.google.com")
if uri.instance_of?(URI::Generic)
uri = URI::HTTP.build({:host => uri.to_s})
end
content_file = open(uri)
There are other ways as well see ref: http://www.ruby-doc.org/stdlib-2.0.0/libdoc/uri/rdoc/URI/HTTP.html
Prepend the scheme if not present and then use URI which will check the URL validity:
require 'uri'
url = 'www.google.com/a/b?c=d#e'
url.prepend "http://" unless url.start_with?('http://', 'https://')
url = URI(url) # it will raise error if the url is not valid
open url
Unfortunately, an "object oriented" version of what you need is more verbose and even more hackish:
require 'uri'
case url = URI.parse 'www.google.com/a/b?c=d#e'
when URI::HTTP, URI::HTTPS
# no-op
when URI::Generic
# We need to split u.path at the first '/', since URI::Generic interprets
# 'www.google.com/a/b' as a single path
host, path = url.path.split '/', 2
url = URI::HTTP.build host: host ,
path: "/#{path}" ,
query: url.query ,
fragment: url.fragment
else
raise "unsupported url class (#{url.class}) for #{url}"
end
open url
If you accept suggestions, don't break your head too much on this: I faced this matter often and I'm quite sure there aren't "polished" ways to do it
You need to prepend http to the urls, without an explicit scheme the uri could be anything, e.g. a local file. A uri is not necessarily an http url.
You can check either by using the URI class or by using a regex:
user_input_url = URI.parse(user_input_url).scheme ?
user_input_url :
"http://#{user_input_url}"
user_input_url = user_input_url =~ /https?:\/\// ?
user_input_url :
"http://#{user_input_url}"
def instance_to_hash(instance)
hash = {}
instance.instance_variables.each {|var| hash[var[1..-1].to_sym] = instance.instance_variable_get(var) }
hash
end
def url_compile(url)
# if url without 'http://', 'https://', '//' at start of string
# then prepend '//'
url.prepend '//' unless url.start_with?('http://', 'https://', '//')
uri = URI(url)
if uri.instance_of?(URI::Generic) # if scheme nil then assume it HTTPS
uri = URI::HTTPS.build(instance_to_hash(uri))
end
uri
end

How do I follow URL redirection?

I have a URL and I need to retrieve the URL it redirects to (the number of redirections is arbitrary).
One real example I'm working on is:
https://www.google.com/url?q=http://m.zynga.com/about/privacy-center/privacy-policy&sa=D&usg=AFQjCNESJyXBeZenALhKWb52N1vHouAd5Q
which will eventually redirect to:
http://company.zynga.com/privacy/policy
which is the URL I'm interested in.
I tried with open-uri as follows:
privacy_url = "https://www.google.com/url?q=http://m.zynga.com/about/privacy-center/privacy-policy&sa=D&usg=AFQjCNESJyXBeZenALhKWb52N1vHouAd5Q"
final_url = nil
open(privacy_url) do |h|
puts "Redirecting to #{h.base_uri}"
final_url = h.base_uri
end
but I keep getting the original URL back, meaning that final_url is equal to privacy_url.
Is there any way to follow this kind of redirection and programmatically access the resulting URL?
I finally made it, using the Mechanize gem. They key is to enable the follow_meta_refresh options, which is disabled by default.
Here's how
require 'mechanize'
browser = Mechanize.new
browser.follow_meta_refresh = true
start_url = "https://www.google.com/url?q=http://m.zynga.com/about/privacy-center/privacy-policy&sa=D&usg=AFQjCNESJyXBeZenALhKWb52N1vHouAd5Q"
final_url = nil
browser.get(start_url) do |page|
final_url = page.uri.to_s
end
puts final_url # => http://company.zynga.com/privacy/policy

How Do I search Twitter for a word with Ruby?

I have written code in Ruby that will display the timeline for a specific user. I would like to write code to be able to just search twitter to just find every user that has mentioned a word. My code is currently:
require 'rubygems'
require 'oauth'
require 'json'
# Now you will fetch /1.1/statuses/user_timeline.json,
# returns a list of public Tweets from the specified
# account.
baseurl = "https://api.twitter.com"
path = "/1.1/statuses/user_timeline.json"
query = URI.encode_www_form(
"q" => "Obama"
)
address = URI("#{baseurl}#{path}?#{query}")
request = Net::HTTP::Get.new address.request_uri
# Print data about a list of Tweets
def print_timeline(tweets)
tweets.each do |tweet|
require 'date'
d = DateTime.parse(tweet['created_at'])
puts " #{tweet['text'].delete ","} , #{d.strftime('%d.%m.%y')} , #{tweet['user']['name']}, #{tweet['id']}"
end
end
# Set up HTTP.
http = Net::HTTP.new address.host, address.port
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
# If you entered your credentials in the first
# exercise, no need to enter them again here. The
# ||= operator will only assign these values if
# they are not already set.
consumer_key = OAuth::Consumer.new(
"")
access_token = OAuth::Token.new(
"")
# Issue the request.
request.oauth! http, consumer_key, access_token
http.start
response = http.request request
# Parse and print the Tweet if the response code was 200
tweets = nil
puts "Text,Date,Name,id"
if response.code == '200' then
tweets = JSON.parse(response.body)
print_timeline(tweets)
end
nil
How would I possibly change this code to search all of twitter for a specific word?
The easiest approach would be to use 'Twitter' gem. Refer to this Link for more information and the result type of the search results. Once you have all the correct authorization attribute in place (oAuth-Token,oAuth-secret, etc) you should be able to search as
Twitter.search('Obama')
or
Twitter.search('Obama', options = {})
Let us know, if that worked for you or not.
p.s. - Please mark the post as answered if it helped you. Else put a comment back with what is missing.
The Twitter API suggests the URI your should be using for global search is https://api.twitter.com/1.1/search/tweets.json and this means:
Your base_url component would be https://api.twitter.com
Your path component would be /1.1/search/tweets.json
Your query component would be the text you are searching for.
The query part takes a lot of values depending upon the API spec. Refer to the specification and you can change it as per your requirement.
Tip: Try to use irb (I'd recommend pry) REPL which makes it a lot easier to explore APIs. Also, checkout the Faraday gem which can be easier to use than the default HTTP library in Ruby IMO.

How to check if a URL is valid

How can I check if a string is a valid URL?
For example:
http://hello.it => yes
http:||bra.ziz, => no
If this is a valid URL how can I check if this is relative to a image file?
Notice:
As pointed by #CGuess, there's a bug with this issue and it's been documented for over 9 years now that validation is not the purpose of this regular expression (see https://bugs.ruby-lang.org/issues/6520).
Use the URI module distributed with Ruby:
require 'uri'
if url =~ URI::regexp
# Correct URL
end
Like Alexander Günther said in the comments, it checks if a string contains a URL.
To check if the string is a URL, use:
url =~ /\A#{URI::regexp}\z/
If you only want to check for web URLs (http or https), use this:
url =~ /\A#{URI::regexp(['http', 'https'])}\z/
Similar to the answers above, I find using this regex to be slightly more accurate:
URI::DEFAULT_PARSER.regexp[:ABS_URI]
That will invalidate URLs with spaces, as opposed to URI.regexp which allows spaces for some reason.
I have recently found a shortcut that is provided for the different URI rgexps. You can access any of URI::DEFAULT_PARSER.regexp.keys directly from URI::#{key}.
For example, the :ABS_URI regexp can be accessed from URI::ABS_URI.
The problem with the current answers is that a URI is not an URL.
A URI can be further classified as a locator, a name, or both. The
term "Uniform Resource Locator" (URL) refers to the subset of URIs
that, in addition to identifying a resource, provide a means of
locating the resource by describing its primary access mechanism
(e.g., its network "location").
Since URLs are a subset of URIs, it is clear that matching specifically for URIs will successfully match undesired values. For example, URNs:
"urn:isbn:0451450523" =~ URI::regexp
=> 0
That being said, as far as I know, Ruby doesn't have a default way to parse URLs , so you'll most likely need a gem to do so. If you need to match URLs specifically in HTTP or HTTPS format, you could do something like this:
uri = URI.parse(my_possible_url)
if uri.kind_of?(URI::HTTP) or uri.kind_of?(URI::HTTPS)
# do your stuff
end
I prefer the Addressable gem. I have found that it handles URLs more intelligently.
require 'addressable/uri'
SCHEMES = %w(http https)
def valid_url?(url)
parsed = Addressable::URI.parse(url) or return false
SCHEMES.include?(parsed.scheme)
rescue Addressable::URI::InvalidURIError
false
end
This is a fairly old entry, but I thought I'd go ahead and contribute:
String.class_eval do
def is_valid_url?
uri = URI.parse self
uri.kind_of? URI::HTTP
rescue URI::InvalidURIError
false
end
end
Now you can do something like:
if "http://www.omg.wtf".is_valid_url?
p "huzzah!"
end
For me, I use this regular expression:
/\A(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?\z/ix
Option:
i - case insensitive
x - ignore whitespace in regex
You can set this method to check URL validation:
def valid_url?(url)
return false if url.include?("<script")
url_regexp = /\A(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?\z/ix
url =~ url_regexp ? true : false
end
To use it:
valid_url?("http://stackoverflow.com/questions/1805761/check-if-url-is-valid-ruby")
Testing with wrong URLs:
http://ruby3arabi - result is invalid
http://http://ruby3arabi.com - result is invalid
http:// - result is invalid
http://test.com\n<script src=\"nasty.js\"> (Just simply check "<script")
127.0.0.1 - not support IP address
Test with correct URLs:
http://ruby3arabi.com - result is valid
http://www.ruby3arabi.com - result is valid
https://www.ruby3arabi.com - result is valid
https://www.ruby3arabi.com/article/1 - result is valid
https://www.ruby3arabi.com/websites/58e212ff6d275e4bf9000000?locale=en - result is valid
In general,
/^#{URI::regexp}$/
will work well, but if you only want to match http or https, you can pass those in as options to the method:
/^#{URI::regexp(%w(http https))}$/
That tends to work a little better, if you want to reject protocols like ftp://.
This is a little bit old but here is how I do it. Use Ruby's URI module to parse the URL. If it can be parsed then it's a valid URL. (But that doesn't mean accessible.)
URI supports many schemes, plus you can add custom schemes yourself:
irb> uri = URI.parse "http://hello.it" rescue nil
=> #<URI::HTTP:0x10755c50 URL:http://hello.it>
irb> uri.instance_values
=> {"fragment"=>nil,
"registry"=>nil,
"scheme"=>"http",
"query"=>nil,
"port"=>80,
"path"=>"",
"host"=>"hello.it",
"password"=>nil,
"user"=>nil,
"opaque"=>nil}
irb> uri = URI.parse "http:||bra.ziz" rescue nil
=> nil
irb> uri = URI.parse "ssh://hello.it:5888" rescue nil
=> #<URI::Generic:0x105fe938 URL:ssh://hello.it:5888>
[26] pry(main)> uri.instance_values
=> {"fragment"=>nil,
"registry"=>nil,
"scheme"=>"ssh",
"query"=>nil,
"port"=>5888,
"path"=>"",
"host"=>"hello.it",
"password"=>nil,
"user"=>nil,
"opaque"=>nil}
See the documentation for more information about the URI module.
You could also use a regex, maybe something like http://www.geekzilla.co.uk/View2D3B0109-C1B2-4B4E-BFFD-E8088CBC85FD.htm assuming this regex is correct (I haven't fully checked it) the following will show the validity of the url.
url_regex = Regexp.new("((https?|ftp|file):((//)|(\\\\))+[\w\d:\##%/;$()~_?\+-=\\\\.&]*)")
urls = [
"http://hello.it",
"http:||bra.ziz"
]
urls.each { |url|
if url =~ url_regex then
puts "%s is valid" % url
else
puts "%s not valid" % url
end
}
The above example outputs:
http://hello.it is valid
http:||bra.ziz not valid

Resources