Ruby Net::HTTP - following 301 redirects - ruby

My users submit urls (to mixes on mixcloud.com) and my app uses them to perform web requests.
A good url returns a 200 status code:
uri = URI.parse("http://www.mixcloud.com/ErolAlkan/hard-summer-mix/")
request = Net::HTTP.get_response(uri)(
#<Net::HTTPOK 200 OK readbody=true>
But if you forget the trailing slash then our otherwise good url returns a 301:
uri = "http://www.mixcloud.com/ErolAlkan/hard-summer-mix"
#<Net::HTTPMovedPermanently 301 MOVED PERMANENTLY readbody=true>
The same thing happens with 404's:
# bad path returns a 404
"http://www.mixcloud.com/bad/path/"
# bad path minus trailing slash returns a 301
"http://www.mixcloud.com/bad/path"
How can I 'drill down' into the 301 to see if it takes us on to a valid resource or an error page?
Is there a tool that provides a comprehensive overview of the rules that a particular domain might apply to their urls?

301 redirects are fairly common if you do not type the URL exactly as the web server expects it. They happen much more frequently than you'd think, you just don't normally ever notice them while browsing because the browser does all that automatically for you.
Two alternatives come to mind:
1: Use open-uri
open-uri handles redirects automatically. So all you'd need to do is:
require 'open-uri'
...
response = open('http://xyz...').read
If you have trouble redirecting between HTTP and HTTPS, then have a look here for a solution:
Ruby open-uri redirect forbidden
2: Handle redirects with Net::HTTP
def get_response_with_redirect(uri)
r = Net::HTTP.get_response(uri)
if r.code == "301"
r = Net::HTTP.get_response(URI.parse(r['location']))
end
r
end
If you want to be even smarter you could try to add or remove missing backslashes to the URL when you get a 404 response. You could do that by creating a method like get_response_smart which handles this URL fiddling in addition to the redirects.

I can't figure out how to comment on the accepted answer (this question might be closed), but I should note that r.header is now obsolete, so r.header['location'] should be replaced by r['location'] (per https://stackoverflow.com/a/6934503/1084675 )

rest-client follows the redirections for GET and HEAD requests without any additional configuration. It works very nice.
for result codes between 200 and 207, a RestClient::Response will be returned
for result codes 301, 302 or 307, the redirection will be followed if the request is a GET or a HEAD
for result code 303, the redirection will be followed and the request transformed into a GET
example of usage:
require 'rest-client'
RestClient.get 'http://example.com/resource'
The rest-client README also gives an example of following redirects with POST requests:
begin
RestClient.post('http://example.com/redirect', 'body')
rescue RestClient::MovedPermanently,
RestClient::Found,
RestClient::TemporaryRedirect => err
err.response.follow_redirection
end

Here is the code I came up with (derived from different examples) which will bail out if there are too many redirects (note that ensure_success is optional):
require "net/http"
require "uri"
class Net::HTTPResponse
def ensure_success
unless kind_of? Net::HTTPSuccess
warn "Request failed with HTTP #{#code}"
each_header do |h,v|
warn "#{h} => #{v}"
end
abort
end
end
end
def do_request(uri_string)
response = nil
tries = 0
loop do
uri = URI.parse(uri_string)
http = Net::HTTP.new(uri.host, uri.port)
request = Net::HTTP::Get.new(uri.request_uri)
response = http.request(request)
uri_string = response['location'] if response['location']
unless response.kind_of? Net::HTTPRedirection
response.ensure_success
break
end
if tries == 10
puts "Timing out after 10 tries"
break
end
tries += 1
end
response
end

Not sure if anyone is looking for this exact solution, but if you are trying to download an image http/https and store it to a variable
require 'open_uri_redirections'
require 'net/https'
web_contents = open('file_url_goes_here', :ssl_verify_mode => OpenSSL::SSL::VERIFY_NONE, :allow_redirections => :all) {|f| f.read }
puts web_contents

Related

How do I get a redirected status code using NET:HTTP?

Similar to "getting the status code of a HTTP redirected page", but with NET::HTTP instead of curb I am making a GET request to a page that that will redirect:
response = Net::HTTP.get_response(URI.parse("http://www.wikipedia.org/wiki/URL_redirection"))
puts response.code #{
puts response['location']
=> 301
en.wikipedia.org/wiki/URL_redirection
The problem is that I want to know the status code of the redirected page. In this case it is 200, but in my app I want to check if it is 200 or something else.
The solution I've seen is to just call get_response(response['location']), but that won't work in my application because the way the redirect is designed makes it so that the redirect can only be followed once. Since the first GET consumes that one redirect, I can't then follow it again.
Is there some way to get the last status code that is a result of a GET?
EDIT: Further clarification of the situation:
The application that I'm sending GET to has a single sign-on authentication mechanism where, if I want to access 'myapp/mypage', I have to first send a post:
postResponse = Net::HTTP.post_form(URI.parse("http://myapp.com/trusted"), {"username" => #username})
Then make the GET request to:
'http://myapp.com/trusted/#{postResponse.body}/mypage
*The postResponse.body is a 'ticket' which can be redeemed once.
That GET verifies that the ticket is valid and then redirects to:
myapp.com/mypage
So whether that ticket is valid or not, I get a 301.
I want to check the status code of the final get to myapp.com/mypage.
If I manually try to follow the redirect, whether it's a HEAD request or a GET, the original redirect will have already consumed the ticket, so I will get an error that the ticket is expired even if the original redirect was a 200.
The Net::HTTP documentation has example code showing how to deal with redirects. Have you tried it? It should make it easy to get inside the redirect mechanism and grab statuses for later.
Here's their example:
Following Redirection
Each Net::HTTPResponse object belongs to a class for its response code.
For example, all 2XX responses are instances of a Net::HTTPSuccess subclass, a 3XX response is an instance of a Net::HTTPRedirection subclass and a 200 response is an instance of the Net::HTTPOK class. For details of response classes, see the section “HTTP Response Classes” below.
Using a case statement you can handle various types of responses properly:
def fetch(uri_str, limit = 10)
# You should choose a better exception.
raise ArgumentError, 'too many HTTP redirects' if limit == 0
response = Net::HTTP.get_response(URI(uri_str))
case response
when Net::HTTPSuccess then
response
when Net::HTTPRedirection then
location = response['location']
warn "redirected to #{location}"
fetch(location, limit - 1)
else
response.value
end
end
print fetch('http://www.ruby-lang.org')
A minor change like this should help:
require 'net/http'
RESPONSES = []
def fetch(uri_str, limit = 10)
# You should choose a better exception.
raise ArgumentError, 'too many HTTP redirects' if limit == 0
response = Net::HTTP.get_response(URI(uri_str))
RESPONSES << response
case response
when Net::HTTPSuccess then
response
when Net::HTTPRedirection then
location = response['location']
warn "redirected to #{location}"
fetch(location, limit - 1)
else
response.value
end
end
print fetch('http://jigsaw.w3.org/HTTP/300/302.html')
puts RESPONSES.join("\n") # =>
I see this when I run it:
redirected to http://jigsaw.w3.org/HTTP/300/Overview.html
#<Net::HTTPOK:0x007f9e82a1e050>#<Net::HTTPFound:0x007f9e82a2daa0>
#<Net::HTTPOK:0x007f9e82a1e050>
If it's enough just to make an HTTP HEAD request without 'consuming' your URL (this would be the usual expectation for a HEAD request), you can do it like this:
2.0.0-p195 :143 > result = Net::HTTP.start('www.google.com') { |http| http.head '/' }
=> #<Net::HTTPFound 302 Found readbody=true>
So in your example you'd do this:
...
result = Net::HTTP.start(response.uri.host) { |http| http.head response.uri.path }
If you want to preserve a history of response codes, you could try this. This retains the last 5 response codes from calls to get_response and exposes them through a Net::HTTP.history method.
module Net
class << HTTP
alias_method :_get_response, :get_response
def get_response *args, &block
resp = _get_response *args, &block
#history = (#history || []).push(resp.code).last 5
resp
end
def history
#history || []
end
end
end
(I don't entirely get the usage scenario, so adapt to your needs)

Parse html GET via open() with nokogiri - redirect exception

I'm trying to learn ruby, so I'm following an exercise of google dev. I'm trying to parse some links. In the case of successful redirection (considering that I know that it its possible only to get redirected once), I get redirect forbidden. I noticed that I go from a http protocol link to an https protocol link. Any concrete idea how could I implement in this in ruby because google's exercise is for python?
error:
ruby fix.rb
redirection forbidden: http://code.google.com/edu/languages/google-python-class/images/puzzle/p-bija-baei.jpg -> https://developers.google.com/edu/python/images/puzzle/p-bija-baei.jpg?csw=1
code that should achieve what I'm looking for:
def acquireData(urls, imgs) #List item urls list of valid urls !checked, imgs list of the imgs I'll download afterwards.
begin
urls.each do |url|
page = Nokogiri::HTML(open(url))
puts page.body
end
rescue Exception => e
puts e
end
end
Ruby's OpenURI will automatically handle redirects for you, as long as they're not "meta-refresh" that occur inside the HTML itself.
For instance, this follows a redirect automatically:
irb(main):008:0> page = open('http://www.example.org')
#<StringIO:0x00000002ae2de0>
irb(main):009:0> page.base_uri.to_s
"http://www.iana.org/domains/example"
In other words, the request to "www.example.org" got redirected to "www.iana.org" and OpenURI tracked it correctly.
If you are trying to learn HOW to handle redirects, read the Net::HTTP documentation. Here is the example how to do it from the document:
Following Redirection
Each Net::HTTPResponse object belongs to a class for its response code.
For example, all 2XX responses are instances of a Net::HTTPSuccess subclass, a 3XX response is an instance of a Net::HTTPRedirection subclass and a 200 response is an instance of the Net::HTTPOK class. For details of response classes, see the section “HTTP Response Classes” below.
Using a case statement you can handle various types of responses properly:
def fetch(uri_str, limit = 10)
# You should choose a better exception.
raise ArgumentError, 'too many HTTP redirects' if limit == 0
response = Net::HTTP.get_response(URI(uri_str))
case response
when Net::HTTPSuccess then
response
when Net::HTTPRedirection then
location = response['location']
warn "redirected to #{location}"
fetch(location, limit - 1)
else
response.value
end
end
print fetch('http://www.ruby-lang.org')
If you want to handle meta-refresh statements, reflect on this:
require 'nokogiri'
doc = Nokogiri::HTML(%[<meta http-equiv="refresh" content="5;URL='http://example.com/'">])
meta_refresh = doc.at('meta[http-equiv="refresh"]')
if meta_refresh
puts meta_refresh['content'][/URL=(.+)/, 1].gsub(/['"]/, '')
end
Which outputs:
http://example.com/
Basically the url in code.google that you're trying to open redirects to a https url. You can see that by yourself if you paste http://code.google.com/edu/languages/google-python-class/images/puzzle/p-bija-baei.jpg into your browser
Check the following bug report that explains why open-uri can't redirect to https;
So the solution to your problem is simply: use a different set of urls (that don't redirect to https)

Getting the contents of a 404 error page response ruby

I know some languages have a library that allows you to get the HTTP content for a 404 or 500 message.
Is there a library that allows that for Ruby?
I've tried open-uri but it simply returns an HTTPError exception without the HTML content for the 404 response.
This doesn't seem to be stated clearly enough in the docs, but HttpError has an io attribute, which you can treat as a read only file as far as i know.
require 'open-uri'
begin
response = open('http://google.com/blahblah')
rescue => e
puts e # Error message
puts e.io.status # Http Error code
puts e.io.readlines # Http response body
end
Net::HTTP supports what you need.
You can use the request_get method and it will return a response regardless of the status code.
From script/console:
> http = Net::HTTP.new('localhost', 3000)
=> #<Net::HTTP localhost:3000 open=false>
> resp = http.request_get('/foo') # a page that doesn't exist
=> #<Net::HTTPNotFound 404 Not Found readbody=true>
> resp.code
=> "404"
> resp.body
=> "<html>...</html>"
(If the library is not available to you by default, you can do a require 'net/http'
Works with HTTParty as well https://github.com/jnunemaker/httparty
require 'rubygems'
require 'httparty'
HTTParty.get("http://google.com/blahblah").parsed_response
There are a number of HTTP Clients available, choose one you like from https://www.ruby-toolbox.com/categories/http_clients

User-Agent in HTTP requests, Ruby

I'm pretty new to Ruby. I've tried looking over the online documentation, but I haven't found anything that quite works. I'd like to include a User-Agent in the following HTTP requests, bot get_response() and get(). Can someone point me in the right direction?
# Preliminary check that Proggit is up
check = Net::HTTP.get_response(URI.parse(proggit_url))
if check.code != "200"
puts "Error contacting Proggit"
return
end
# Attempt to get the json
response = Net::HTTP.get(URI.parse(proggit_url))
if response.nil?
puts "Bad response when fetching Proggit json"
return
end
Amir F is correct, that you may enjoy using another HTTP client like RestClient or Faraday, but if you wanted to stick with the standard Ruby library you could set your user agent like this:
url = URI.parse(proggit_url)
req = Net::HTTP::Get.new(proggit_url)
req.add_field('User-Agent', 'My User Agent Dawg')
res = Net::HTTP.start(url.host, url.port) {|http| http.request(req) }
res.body
Net::HTTP is very low level, I would recommend using the rest-client gem - it will also follows redirects automatically and be easier for you to work with, i.e:
require 'rest_client'
response = RestClient.get proggit_url
if response.code != 200
# do something
end

Check if URL exists in Ruby

How would I go about checking if a URL exists using Ruby?
For example, for the URL
https://google.com
the result should be truthy, but for the URLs
https://no.such.domain
or
https://stackoverflow.com/no/such/path
the result should be falsey
Use the Net::HTTP library.
require "net/http"
url = URI.parse("http://www.google.com/")
req = Net::HTTP.new(url.host, url.port)
res = req.request_head(url.path)
At this point res is a Net::HTTPResponse object containing the result of the request. You can then check the response code:
do_something_with_it(url) if res.code == "200"
Note: To check for https based url, use_ssl attribute should be true as:
require "net/http"
url = URI.parse("https://www.google.com/")
req = Net::HTTP.new(url.host, url.port)
req.use_ssl = true
res = req.request_head(url.path)
Sorry for the late reply on this, but I think this deserves a better answer.
There are three ways to look at this question:
Strict check if the URL exist
Check if you are requesting the URL correctly
Check if you can request it correctly and the server can answer it correctly
1. Strict check if the URL exist
While 200 means that the server answers to that URL (thus, the URL exists), answering other status code doesn't means that the URL does not exist. For example, answering 302 - redirected means that the URL exists and is redirecting to another one. While browsing, 302 many times behaves the same than 200 to the final user. Other status code that can be returned if a URL exists is 500 - internal server error. After all, if the URL does not exists, how it comes the application server processed your request instead return simply 404 - not found?
So there are actually only two cases when a URL does not exist: When the server does not exist or when the server exists but can't find the given URL path does not exist. Thus, the only way to check if the URL exists is checking if the server answers and the return code is not 404. The following code does just that.
require "net/http"
def url_exist?(url_string)
url = URI.parse(url_string)
req = Net::HTTP.new(url.host, url.port)
req.use_ssl = (url.scheme == 'https')
path = url.path if url.path.present?
res = req.request_head(path || '/')
res.code != "404" # false if returns 404 - not found
rescue Errno::ENOENT
false # false if can't find the server
end
2. Check if you are requesting the URL correctly
However, most of the times we are not interested in see if a URL exists, but if we can access it. Fortunately looking to the HTTP status codes families, that is the 4xx family, which states for client error (thus, an error in your side, which means you are not requesting the page correctly, don't have permission or whatsoever). This is a good of errors to check if you can access this page. From wiki:
The 4xx class of status code is intended for cases in which the client seems to have erred. Except when responding to a HEAD request, the server should include an entity containing an explanation of the error situation, and whether it is a temporary or permanent condition. These status codes are applicable to any request method. User agents should display any included entity to the user.
So the following code make sure the URL exists and you can access it:
require "net/http"
def url_exist?(url_string)
url = URI.parse(url_string)
req = Net::HTTP.new(url.host, url.port)
req.use_ssl = (url.scheme == 'https')
path = url.path if url.path.present?
res = req.request_head(path || '/')
if res.kind_of?(Net::HTTPRedirection)
url_exist?(res['location']) # Go after any redirect and make sure you can access the redirected URL
else
res.code[0] != "4" #false if http code starts with 4 - error on your side.
end
rescue Errno::ENOENT
false #false if can't find the server
end
3. Check if you can request it correctly and the server can answer it correctly
Just like the 4xx family checks if you can access the URL, the 5xx family checks if the server had any problem answering your request. An error on this family most of the times are due problems on the server itself, and hopefully they are working on solve it. If You need to be able to access the page and get a correct answer now, you should make sure the answer is not from 4xx or 5xx family, and if you was redirected, the redirected page answers correctly. So much similar to (2), you can simply use the following code:
require "net/http"
def url_exist?(url_string)
url = URI.parse(url_string)
req = Net::HTTP.new(url.host, url.port)
req.use_ssl = (url.scheme == 'https')
path = url.path if url.path.present?
res = req.request_head(path || '/')
if res.kind_of?(Net::HTTPRedirection)
url_exist?(res['location']) # Go after any redirect and make sure you can access the redirected URL
else
! %W(4 5).include?(res.code[0]) # Not from 4xx or 5xx families
end
rescue Errno::ENOENT
false #false if can't find the server
end
Net::HTTP works but if you can work outside stdlib, Faraday is better.
Faraday.head(the_url).status == 200
(200 is a success code, assuming that's what you meant by "exists".)
Simone's answer was very helpful to me.
Here is a version that returns true/false depending on URL validity, and which handles redirects:
require 'net/http'
require 'set'
def working_url?(url, max_redirects=6)
response = nil
seen = Set.new
loop do
url = URI.parse(url)
break if seen.include? url.to_s
break if seen.size > max_redirects
seen.add(url.to_s)
response = Net::HTTP.new(url.host, url.port).request_head(url.path)
if response.kind_of?(Net::HTTPRedirection)
url = response['location']
else
break
end
end
response.kind_of?(Net::HTTPSuccess) && url.to_s
end

Resources