Check if URL exists in Ruby - ruby

How would I go about checking if a URL exists using Ruby?
For example, for the URL
https://google.com
the result should be truthy, but for the URLs
https://no.such.domain
or
https://stackoverflow.com/no/such/path
the result should be falsey

Use the Net::HTTP library.
require "net/http"
url = URI.parse("http://www.google.com/")
req = Net::HTTP.new(url.host, url.port)
res = req.request_head(url.path)
At this point res is a Net::HTTPResponse object containing the result of the request. You can then check the response code:
do_something_with_it(url) if res.code == "200"
Note: To check for https based url, use_ssl attribute should be true as:
require "net/http"
url = URI.parse("https://www.google.com/")
req = Net::HTTP.new(url.host, url.port)
req.use_ssl = true
res = req.request_head(url.path)

Sorry for the late reply on this, but I think this deserves a better answer.
There are three ways to look at this question:
Strict check if the URL exist
Check if you are requesting the URL correctly
Check if you can request it correctly and the server can answer it correctly
1. Strict check if the URL exist
While 200 means that the server answers to that URL (thus, the URL exists), answering other status code doesn't means that the URL does not exist. For example, answering 302 - redirected means that the URL exists and is redirecting to another one. While browsing, 302 many times behaves the same than 200 to the final user. Other status code that can be returned if a URL exists is 500 - internal server error. After all, if the URL does not exists, how it comes the application server processed your request instead return simply 404 - not found?
So there are actually only two cases when a URL does not exist: When the server does not exist or when the server exists but can't find the given URL path does not exist. Thus, the only way to check if the URL exists is checking if the server answers and the return code is not 404. The following code does just that.
require "net/http"
def url_exist?(url_string)
url = URI.parse(url_string)
req = Net::HTTP.new(url.host, url.port)
req.use_ssl = (url.scheme == 'https')
path = url.path if url.path.present?
res = req.request_head(path || '/')
res.code != "404" # false if returns 404 - not found
rescue Errno::ENOENT
false # false if can't find the server
end
2. Check if you are requesting the URL correctly
However, most of the times we are not interested in see if a URL exists, but if we can access it. Fortunately looking to the HTTP status codes families, that is the 4xx family, which states for client error (thus, an error in your side, which means you are not requesting the page correctly, don't have permission or whatsoever). This is a good of errors to check if you can access this page. From wiki:
The 4xx class of status code is intended for cases in which the client seems to have erred. Except when responding to a HEAD request, the server should include an entity containing an explanation of the error situation, and whether it is a temporary or permanent condition. These status codes are applicable to any request method. User agents should display any included entity to the user.
So the following code make sure the URL exists and you can access it:
require "net/http"
def url_exist?(url_string)
url = URI.parse(url_string)
req = Net::HTTP.new(url.host, url.port)
req.use_ssl = (url.scheme == 'https')
path = url.path if url.path.present?
res = req.request_head(path || '/')
if res.kind_of?(Net::HTTPRedirection)
url_exist?(res['location']) # Go after any redirect and make sure you can access the redirected URL
else
res.code[0] != "4" #false if http code starts with 4 - error on your side.
end
rescue Errno::ENOENT
false #false if can't find the server
end
3. Check if you can request it correctly and the server can answer it correctly
Just like the 4xx family checks if you can access the URL, the 5xx family checks if the server had any problem answering your request. An error on this family most of the times are due problems on the server itself, and hopefully they are working on solve it. If You need to be able to access the page and get a correct answer now, you should make sure the answer is not from 4xx or 5xx family, and if you was redirected, the redirected page answers correctly. So much similar to (2), you can simply use the following code:
require "net/http"
def url_exist?(url_string)
url = URI.parse(url_string)
req = Net::HTTP.new(url.host, url.port)
req.use_ssl = (url.scheme == 'https')
path = url.path if url.path.present?
res = req.request_head(path || '/')
if res.kind_of?(Net::HTTPRedirection)
url_exist?(res['location']) # Go after any redirect and make sure you can access the redirected URL
else
! %W(4 5).include?(res.code[0]) # Not from 4xx or 5xx families
end
rescue Errno::ENOENT
false #false if can't find the server
end

Net::HTTP works but if you can work outside stdlib, Faraday is better.
Faraday.head(the_url).status == 200
(200 is a success code, assuming that's what you meant by "exists".)

Simone's answer was very helpful to me.
Here is a version that returns true/false depending on URL validity, and which handles redirects:
require 'net/http'
require 'set'
def working_url?(url, max_redirects=6)
response = nil
seen = Set.new
loop do
url = URI.parse(url)
break if seen.include? url.to_s
break if seen.size > max_redirects
seen.add(url.to_s)
response = Net::HTTP.new(url.host, url.port).request_head(url.path)
if response.kind_of?(Net::HTTPRedirection)
url = response['location']
else
break
end
end
response.kind_of?(Net::HTTPSuccess) && url.to_s
end

Related

Fastest way to check if a url exists

currently I am writing a program that needs to check tons of possible urls searching for any that actually exist. To be precise, I mean exist as in you can visit the url and there's actual content of some sort.. not string parsing to see if it's in url format.
The program generates a list of possible variants for a filename and then checks each one until it gets a url that actually exists, so most of the url remains the same. Examples would be,
https://www.test.com/folder1/FILE.png
https://www.test.com/folder1/File.png
https://www.test.com/folder1/file.png
https://www.test.com/folder1/file1.png
That said, my code currently works fine.. however it ends up taking about 2-4 secods per url check and I don't know of a way to speed it up. Is there any faster or better way to validate urls or am I just out of luck?
This is my function to validate urls:
require "net/http"
def url_exist? url_path
url = URI.parse(url_path)
req = Net::HTTP.new(url.host, url.port)
req.use_ssl = true
res = req.request_head(url.path)
if res.code == "200" || res.code == "403"
return true
end
end
Thank you for taking the time to read this and any help will be much appreciated.
Your code creates a new connection for each URL. It should be faster to send multiple requests over the same connection via HTTP keep-alive.
In Ruby, you can open such connection via Net::HTTP.start, e.g.:
require 'net/http'
class URLChecker
def initialize(base_url)
uri = URI(base_url)
Net::HTTP.start(uri.host, uri.port, use_ssl: uri.is_a?(URI::HTTPS)) do |http|
#http = http
yield self
end
end
def exist?(path)
res = #http.head(path)
res.code == '200' || res.code == '403'
end
end
URLChecker.new('https://stackoverflow.com') do |uc|
p uc.exist?('/questions/tagged/ruby') #=> true
p uc.exist?('/questions/tagged/python') #=> true
p uc.exist?('/questions/tagged/foobar') #=> false
end

How do I get a redirected status code using NET:HTTP?

Similar to "getting the status code of a HTTP redirected page", but with NET::HTTP instead of curb I am making a GET request to a page that that will redirect:
response = Net::HTTP.get_response(URI.parse("http://www.wikipedia.org/wiki/URL_redirection"))
puts response.code #{
puts response['location']
=> 301
en.wikipedia.org/wiki/URL_redirection
The problem is that I want to know the status code of the redirected page. In this case it is 200, but in my app I want to check if it is 200 or something else.
The solution I've seen is to just call get_response(response['location']), but that won't work in my application because the way the redirect is designed makes it so that the redirect can only be followed once. Since the first GET consumes that one redirect, I can't then follow it again.
Is there some way to get the last status code that is a result of a GET?
EDIT: Further clarification of the situation:
The application that I'm sending GET to has a single sign-on authentication mechanism where, if I want to access 'myapp/mypage', I have to first send a post:
postResponse = Net::HTTP.post_form(URI.parse("http://myapp.com/trusted"), {"username" => #username})
Then make the GET request to:
'http://myapp.com/trusted/#{postResponse.body}/mypage
*The postResponse.body is a 'ticket' which can be redeemed once.
That GET verifies that the ticket is valid and then redirects to:
myapp.com/mypage
So whether that ticket is valid or not, I get a 301.
I want to check the status code of the final get to myapp.com/mypage.
If I manually try to follow the redirect, whether it's a HEAD request or a GET, the original redirect will have already consumed the ticket, so I will get an error that the ticket is expired even if the original redirect was a 200.
The Net::HTTP documentation has example code showing how to deal with redirects. Have you tried it? It should make it easy to get inside the redirect mechanism and grab statuses for later.
Here's their example:
Following Redirection
Each Net::HTTPResponse object belongs to a class for its response code.
For example, all 2XX responses are instances of a Net::HTTPSuccess subclass, a 3XX response is an instance of a Net::HTTPRedirection subclass and a 200 response is an instance of the Net::HTTPOK class. For details of response classes, see the section “HTTP Response Classes” below.
Using a case statement you can handle various types of responses properly:
def fetch(uri_str, limit = 10)
# You should choose a better exception.
raise ArgumentError, 'too many HTTP redirects' if limit == 0
response = Net::HTTP.get_response(URI(uri_str))
case response
when Net::HTTPSuccess then
response
when Net::HTTPRedirection then
location = response['location']
warn "redirected to #{location}"
fetch(location, limit - 1)
else
response.value
end
end
print fetch('http://www.ruby-lang.org')
A minor change like this should help:
require 'net/http'
RESPONSES = []
def fetch(uri_str, limit = 10)
# You should choose a better exception.
raise ArgumentError, 'too many HTTP redirects' if limit == 0
response = Net::HTTP.get_response(URI(uri_str))
RESPONSES << response
case response
when Net::HTTPSuccess then
response
when Net::HTTPRedirection then
location = response['location']
warn "redirected to #{location}"
fetch(location, limit - 1)
else
response.value
end
end
print fetch('http://jigsaw.w3.org/HTTP/300/302.html')
puts RESPONSES.join("\n") # =>
I see this when I run it:
redirected to http://jigsaw.w3.org/HTTP/300/Overview.html
#<Net::HTTPOK:0x007f9e82a1e050>#<Net::HTTPFound:0x007f9e82a2daa0>
#<Net::HTTPOK:0x007f9e82a1e050>
If it's enough just to make an HTTP HEAD request without 'consuming' your URL (this would be the usual expectation for a HEAD request), you can do it like this:
2.0.0-p195 :143 > result = Net::HTTP.start('www.google.com') { |http| http.head '/' }
=> #<Net::HTTPFound 302 Found readbody=true>
So in your example you'd do this:
...
result = Net::HTTP.start(response.uri.host) { |http| http.head response.uri.path }
If you want to preserve a history of response codes, you could try this. This retains the last 5 response codes from calls to get_response and exposes them through a Net::HTTP.history method.
module Net
class << HTTP
alias_method :_get_response, :get_response
def get_response *args, &block
resp = _get_response *args, &block
#history = (#history || []).push(resp.code).last 5
resp
end
def history
#history || []
end
end
end
(I don't entirely get the usage scenario, so adapt to your needs)

Net::HTTP returning 404 when I know it's 301

I've got a piece of Ruby code that I've written to follow a series of potential redirects until it reaches the final URL:
def self.obtain_final_url_in_chain url
logger.debug "Following '#{url}'"
uri = URI url
http = Net::HTTP.start uri.host, uri.port
response = http.request_head url
case response.code
when "301"
obtain_final_url_in_chain response['location']
when "302"
obtain_final_url_in_chain response['location']
else
url
end
end
You call obtain_final_url_in_chain with the url and it should eventually return the final url.
I'm trying it with this URL: http://feeds.5by5.tv/master
Based on http://web-sniffer.net/ this should be redirected to http://5by5.tv/rss as a result of a 301 redirect. Instead though I get a 404 for http://feeds.5by5.tv/master.
The above code is returning 200 for other URLs though (eg. http://feeds.feedburner.com/5by5video).
Does anyone know why this is happening please? It's driving me nuts!
Thanks.
According to the docs for Net::HTTP#request_head, you want to pass the path, not the full url, as the first parameter.
With that and a few other changes, here's one way to rewrite your method:
def obtain_final_url_in_chain(url)
uri = URI url
response = Net::HTTP.start(uri.host, uri.port) do |http|
http.request_head uri.path
end
case response
when Net::HTTPRedirection
obtain_final_url_in_chain response['location']
else
url
end
end

Ruby Net::HTTP - following 301 redirects

My users submit urls (to mixes on mixcloud.com) and my app uses them to perform web requests.
A good url returns a 200 status code:
uri = URI.parse("http://www.mixcloud.com/ErolAlkan/hard-summer-mix/")
request = Net::HTTP.get_response(uri)(
#<Net::HTTPOK 200 OK readbody=true>
But if you forget the trailing slash then our otherwise good url returns a 301:
uri = "http://www.mixcloud.com/ErolAlkan/hard-summer-mix"
#<Net::HTTPMovedPermanently 301 MOVED PERMANENTLY readbody=true>
The same thing happens with 404's:
# bad path returns a 404
"http://www.mixcloud.com/bad/path/"
# bad path minus trailing slash returns a 301
"http://www.mixcloud.com/bad/path"
How can I 'drill down' into the 301 to see if it takes us on to a valid resource or an error page?
Is there a tool that provides a comprehensive overview of the rules that a particular domain might apply to their urls?
301 redirects are fairly common if you do not type the URL exactly as the web server expects it. They happen much more frequently than you'd think, you just don't normally ever notice them while browsing because the browser does all that automatically for you.
Two alternatives come to mind:
1: Use open-uri
open-uri handles redirects automatically. So all you'd need to do is:
require 'open-uri'
...
response = open('http://xyz...').read
If you have trouble redirecting between HTTP and HTTPS, then have a look here for a solution:
Ruby open-uri redirect forbidden
2: Handle redirects with Net::HTTP
def get_response_with_redirect(uri)
r = Net::HTTP.get_response(uri)
if r.code == "301"
r = Net::HTTP.get_response(URI.parse(r['location']))
end
r
end
If you want to be even smarter you could try to add or remove missing backslashes to the URL when you get a 404 response. You could do that by creating a method like get_response_smart which handles this URL fiddling in addition to the redirects.
I can't figure out how to comment on the accepted answer (this question might be closed), but I should note that r.header is now obsolete, so r.header['location'] should be replaced by r['location'] (per https://stackoverflow.com/a/6934503/1084675 )
rest-client follows the redirections for GET and HEAD requests without any additional configuration. It works very nice.
for result codes between 200 and 207, a RestClient::Response will be returned
for result codes 301, 302 or 307, the redirection will be followed if the request is a GET or a HEAD
for result code 303, the redirection will be followed and the request transformed into a GET
example of usage:
require 'rest-client'
RestClient.get 'http://example.com/resource'
The rest-client README also gives an example of following redirects with POST requests:
begin
RestClient.post('http://example.com/redirect', 'body')
rescue RestClient::MovedPermanently,
RestClient::Found,
RestClient::TemporaryRedirect => err
err.response.follow_redirection
end
Here is the code I came up with (derived from different examples) which will bail out if there are too many redirects (note that ensure_success is optional):
require "net/http"
require "uri"
class Net::HTTPResponse
def ensure_success
unless kind_of? Net::HTTPSuccess
warn "Request failed with HTTP #{#code}"
each_header do |h,v|
warn "#{h} => #{v}"
end
abort
end
end
end
def do_request(uri_string)
response = nil
tries = 0
loop do
uri = URI.parse(uri_string)
http = Net::HTTP.new(uri.host, uri.port)
request = Net::HTTP::Get.new(uri.request_uri)
response = http.request(request)
uri_string = response['location'] if response['location']
unless response.kind_of? Net::HTTPRedirection
response.ensure_success
break
end
if tries == 10
puts "Timing out after 10 tries"
break
end
tries += 1
end
response
end
Not sure if anyone is looking for this exact solution, but if you are trying to download an image http/https and store it to a variable
require 'open_uri_redirections'
require 'net/https'
web_contents = open('file_url_goes_here', :ssl_verify_mode => OpenSSL::SSL::VERIFY_NONE, :allow_redirections => :all) {|f| f.read }
puts web_contents

Check if a given username is taken on Facebook

How to check if a username is already in use on Facebook?
My solution was trying to access http://www.facebook.com/USER and check the http headers (200 = OK; 404 = NOT FOUND). I could use this code:
require 'open-uri'
require 'net/http'
def remote_file_exists?(url,httpcode)
url = URI.parse(url)
Net::HTTP.start(url.host, url.port) do |http|
return http.head(url.request_uri).code == httpcode
end
end
The problem is that Facebook always returns 302 (Found), then redirects to https://www.facebook.com/USER.
I can require net/https and create a new function:
def https_url_exists? (url,httpcode)
url = URI.parse(url)
net = Net::HTTP.new(url.host, url.port)
net.use_ssl = true
net.verify_mode = OpenSSL::SSL::VERIFY_NONE
net.start do |http|
return (http.head(url.request_uri).code == httpcode)
end
end
Now the problem is that some users use dots on their usernames. For example, username might be user.name. Facebook use redirections for this.
What's the best way to check if USERNAME exists on facebook? How to get USER.NAME if USERNAME redirects to it?
You can use https://graph.facebook.com/username. This will return a json response with info to see if it exists as well as enough information to identify it as a user or page.
Once you have a valid user you can get user name First,Last info using:
https://graph.facebook.com/{userId}?fields=first_name,last_name

Resources