Ruby - Validate and update URL - ruby

I've been trying to modify this method from redirecting and returning the contents of the url to returning new valid url instead.
After reading up on the Net::HTTP object, I'm still not sure how exactly the get_response method works. Is this what's downloading the page? is there another method I could call that would just ping the url instead of downloading it?
require 'net/http'
def validate(url)
uri = URI.parse(url)
response = Net::HTTP.get_response(uri)
case response
when Net::HTTPSuccess
return response
when Net::HTTPRedirection
return validate(response['location'])
else
return nill
end
end
puts validate('http://somesite.com/somedir/mypage.html')

You are correct that get_response sends an HTTP GET request to the server, which requests the whole page.
You want to use a HEAD request instead of GET. This requests the same HTTP response header that a GET request would get, including the status code (200, 404, etc.), but without downloading the whole page.
See the request_head and head methods of Net::HTTP. For example
url = URI.parse('http://www.ruby-doc.org/stdlib/libdoc/net/http/rdoc/index.html')
res = Net::HTTP.start(url.host, url.port) {|http|
http.head(url.path)
}
puts res.class

Do you mean, by 'ping the url', you want to know whether the url request returns an HTTP 200 response?
I haven't looked at the implementation of get_response, but I think it just sends out an HTTP GET request, by the looks of it.
If you want to check for the HTTP 200 response, I guess you could just keep doing get_response until you get HTTPSuccess && HTTPOK.

Related

How do I get a redirected status code using NET:HTTP?

Similar to "getting the status code of a HTTP redirected page", but with NET::HTTP instead of curb I am making a GET request to a page that that will redirect:
response = Net::HTTP.get_response(URI.parse("http://www.wikipedia.org/wiki/URL_redirection"))
puts response.code #{
puts response['location']
=> 301
en.wikipedia.org/wiki/URL_redirection
The problem is that I want to know the status code of the redirected page. In this case it is 200, but in my app I want to check if it is 200 or something else.
The solution I've seen is to just call get_response(response['location']), but that won't work in my application because the way the redirect is designed makes it so that the redirect can only be followed once. Since the first GET consumes that one redirect, I can't then follow it again.
Is there some way to get the last status code that is a result of a GET?
EDIT: Further clarification of the situation:
The application that I'm sending GET to has a single sign-on authentication mechanism where, if I want to access 'myapp/mypage', I have to first send a post:
postResponse = Net::HTTP.post_form(URI.parse("http://myapp.com/trusted"), {"username" => #username})
Then make the GET request to:
'http://myapp.com/trusted/#{postResponse.body}/mypage
*The postResponse.body is a 'ticket' which can be redeemed once.
That GET verifies that the ticket is valid and then redirects to:
myapp.com/mypage
So whether that ticket is valid or not, I get a 301.
I want to check the status code of the final get to myapp.com/mypage.
If I manually try to follow the redirect, whether it's a HEAD request or a GET, the original redirect will have already consumed the ticket, so I will get an error that the ticket is expired even if the original redirect was a 200.
The Net::HTTP documentation has example code showing how to deal with redirects. Have you tried it? It should make it easy to get inside the redirect mechanism and grab statuses for later.
Here's their example:
Following Redirection
Each Net::HTTPResponse object belongs to a class for its response code.
For example, all 2XX responses are instances of a Net::HTTPSuccess subclass, a 3XX response is an instance of a Net::HTTPRedirection subclass and a 200 response is an instance of the Net::HTTPOK class. For details of response classes, see the section “HTTP Response Classes” below.
Using a case statement you can handle various types of responses properly:
def fetch(uri_str, limit = 10)
# You should choose a better exception.
raise ArgumentError, 'too many HTTP redirects' if limit == 0
response = Net::HTTP.get_response(URI(uri_str))
case response
when Net::HTTPSuccess then
response
when Net::HTTPRedirection then
location = response['location']
warn "redirected to #{location}"
fetch(location, limit - 1)
else
response.value
end
end
print fetch('http://www.ruby-lang.org')
A minor change like this should help:
require 'net/http'
RESPONSES = []
def fetch(uri_str, limit = 10)
# You should choose a better exception.
raise ArgumentError, 'too many HTTP redirects' if limit == 0
response = Net::HTTP.get_response(URI(uri_str))
RESPONSES << response
case response
when Net::HTTPSuccess then
response
when Net::HTTPRedirection then
location = response['location']
warn "redirected to #{location}"
fetch(location, limit - 1)
else
response.value
end
end
print fetch('http://jigsaw.w3.org/HTTP/300/302.html')
puts RESPONSES.join("\n") # =>
I see this when I run it:
redirected to http://jigsaw.w3.org/HTTP/300/Overview.html
#<Net::HTTPOK:0x007f9e82a1e050>#<Net::HTTPFound:0x007f9e82a2daa0>
#<Net::HTTPOK:0x007f9e82a1e050>
If it's enough just to make an HTTP HEAD request without 'consuming' your URL (this would be the usual expectation for a HEAD request), you can do it like this:
2.0.0-p195 :143 > result = Net::HTTP.start('www.google.com') { |http| http.head '/' }
=> #<Net::HTTPFound 302 Found readbody=true>
So in your example you'd do this:
...
result = Net::HTTP.start(response.uri.host) { |http| http.head response.uri.path }
If you want to preserve a history of response codes, you could try this. This retains the last 5 response codes from calls to get_response and exposes them through a Net::HTTP.history method.
module Net
class << HTTP
alias_method :_get_response, :get_response
def get_response *args, &block
resp = _get_response *args, &block
#history = (#history || []).push(resp.code).last 5
resp
end
def history
#history || []
end
end
end
(I don't entirely get the usage scenario, so adapt to your needs)

Net::HTTP returning 404 when I know it's 301

I've got a piece of Ruby code that I've written to follow a series of potential redirects until it reaches the final URL:
def self.obtain_final_url_in_chain url
logger.debug "Following '#{url}'"
uri = URI url
http = Net::HTTP.start uri.host, uri.port
response = http.request_head url
case response.code
when "301"
obtain_final_url_in_chain response['location']
when "302"
obtain_final_url_in_chain response['location']
else
url
end
end
You call obtain_final_url_in_chain with the url and it should eventually return the final url.
I'm trying it with this URL: http://feeds.5by5.tv/master
Based on http://web-sniffer.net/ this should be redirected to http://5by5.tv/rss as a result of a 301 redirect. Instead though I get a 404 for http://feeds.5by5.tv/master.
The above code is returning 200 for other URLs though (eg. http://feeds.feedburner.com/5by5video).
Does anyone know why this is happening please? It's driving me nuts!
Thanks.
According to the docs for Net::HTTP#request_head, you want to pass the path, not the full url, as the first parameter.
With that and a few other changes, here's one way to rewrite your method:
def obtain_final_url_in_chain(url)
uri = URI url
response = Net::HTTP.start(uri.host, uri.port) do |http|
http.request_head uri.path
end
case response
when Net::HTTPRedirection
obtain_final_url_in_chain response['location']
else
url
end
end

User-Agent in HTTP requests, Ruby

I'm pretty new to Ruby. I've tried looking over the online documentation, but I haven't found anything that quite works. I'd like to include a User-Agent in the following HTTP requests, bot get_response() and get(). Can someone point me in the right direction?
# Preliminary check that Proggit is up
check = Net::HTTP.get_response(URI.parse(proggit_url))
if check.code != "200"
puts "Error contacting Proggit"
return
end
# Attempt to get the json
response = Net::HTTP.get(URI.parse(proggit_url))
if response.nil?
puts "Bad response when fetching Proggit json"
return
end
Amir F is correct, that you may enjoy using another HTTP client like RestClient or Faraday, but if you wanted to stick with the standard Ruby library you could set your user agent like this:
url = URI.parse(proggit_url)
req = Net::HTTP::Get.new(proggit_url)
req.add_field('User-Agent', 'My User Agent Dawg')
res = Net::HTTP.start(url.host, url.port) {|http| http.request(req) }
res.body
Net::HTTP is very low level, I would recommend using the rest-client gem - it will also follows redirects automatically and be easier for you to work with, i.e:
require 'rest_client'
response = RestClient.get proggit_url
if response.code != 200
# do something
end

Ruby Net::HTTP - following 301 redirects

My users submit urls (to mixes on mixcloud.com) and my app uses them to perform web requests.
A good url returns a 200 status code:
uri = URI.parse("http://www.mixcloud.com/ErolAlkan/hard-summer-mix/")
request = Net::HTTP.get_response(uri)(
#<Net::HTTPOK 200 OK readbody=true>
But if you forget the trailing slash then our otherwise good url returns a 301:
uri = "http://www.mixcloud.com/ErolAlkan/hard-summer-mix"
#<Net::HTTPMovedPermanently 301 MOVED PERMANENTLY readbody=true>
The same thing happens with 404's:
# bad path returns a 404
"http://www.mixcloud.com/bad/path/"
# bad path minus trailing slash returns a 301
"http://www.mixcloud.com/bad/path"
How can I 'drill down' into the 301 to see if it takes us on to a valid resource or an error page?
Is there a tool that provides a comprehensive overview of the rules that a particular domain might apply to their urls?
301 redirects are fairly common if you do not type the URL exactly as the web server expects it. They happen much more frequently than you'd think, you just don't normally ever notice them while browsing because the browser does all that automatically for you.
Two alternatives come to mind:
1: Use open-uri
open-uri handles redirects automatically. So all you'd need to do is:
require 'open-uri'
...
response = open('http://xyz...').read
If you have trouble redirecting between HTTP and HTTPS, then have a look here for a solution:
Ruby open-uri redirect forbidden
2: Handle redirects with Net::HTTP
def get_response_with_redirect(uri)
r = Net::HTTP.get_response(uri)
if r.code == "301"
r = Net::HTTP.get_response(URI.parse(r['location']))
end
r
end
If you want to be even smarter you could try to add or remove missing backslashes to the URL when you get a 404 response. You could do that by creating a method like get_response_smart which handles this URL fiddling in addition to the redirects.
I can't figure out how to comment on the accepted answer (this question might be closed), but I should note that r.header is now obsolete, so r.header['location'] should be replaced by r['location'] (per https://stackoverflow.com/a/6934503/1084675 )
rest-client follows the redirections for GET and HEAD requests without any additional configuration. It works very nice.
for result codes between 200 and 207, a RestClient::Response will be returned
for result codes 301, 302 or 307, the redirection will be followed if the request is a GET or a HEAD
for result code 303, the redirection will be followed and the request transformed into a GET
example of usage:
require 'rest-client'
RestClient.get 'http://example.com/resource'
The rest-client README also gives an example of following redirects with POST requests:
begin
RestClient.post('http://example.com/redirect', 'body')
rescue RestClient::MovedPermanently,
RestClient::Found,
RestClient::TemporaryRedirect => err
err.response.follow_redirection
end
Here is the code I came up with (derived from different examples) which will bail out if there are too many redirects (note that ensure_success is optional):
require "net/http"
require "uri"
class Net::HTTPResponse
def ensure_success
unless kind_of? Net::HTTPSuccess
warn "Request failed with HTTP #{#code}"
each_header do |h,v|
warn "#{h} => #{v}"
end
abort
end
end
end
def do_request(uri_string)
response = nil
tries = 0
loop do
uri = URI.parse(uri_string)
http = Net::HTTP.new(uri.host, uri.port)
request = Net::HTTP::Get.new(uri.request_uri)
response = http.request(request)
uri_string = response['location'] if response['location']
unless response.kind_of? Net::HTTPRedirection
response.ensure_success
break
end
if tries == 10
puts "Timing out after 10 tries"
break
end
tries += 1
end
response
end
Not sure if anyone is looking for this exact solution, but if you are trying to download an image http/https and store it to a variable
require 'open_uri_redirections'
require 'net/https'
web_contents = open('file_url_goes_here', :ssl_verify_mode => OpenSSL::SSL::VERIFY_NONE, :allow_redirections => :all) {|f| f.read }
puts web_contents

Accessing Headers for Net::HTTP::Post in ruby

I have the following bit of code:
uri = URI.parse("https://rs.xxx-travel.com/wbsapi/RequestListenerServlet")
https = Net::HTTP.new(uri.host,uri.port)
https.use_ssl = true
req = Net::HTTP::Post.new(uri.path)
req.body = searchxml
req["Accept-Encoding"] ='gzip'
res = https.request(req)
This normally works fine but the server at the other side is complaining about something in my XML and the techies there need the xml message AND the headers that are being sent.
I've got the xml message, but I can't work out how to get at the Headers that are being sent with the above.
To access headers use the each_header method:
# Header being sent (the request object):
req.each_header do |header_name, header_value|
puts "#{header_name} : #{header_value}"
end
# Works with the response object as well:
res.each_header do |header_name, header_value|
puts "#{header_name} : #{header_value}"
end
you can add:
https.set_debug_output $stderr
before the request and you will see in console the real http request sent to the server.
very useful to debug this kind of scenarios.
Take a look at the docs for Net::HTTP's post method. It takes the path of the uri value, the data (XML) you want to post, then the headers you want to set. It returns the response and the body as a two-element array.
I can't test this because you've obscured the host, and odds are good it takes a registered account, but the code looks correct from what I remember when using Net::HTTP.
require 'net/http'
require 'uri'
uri = URI.parse("https://rs.xxx-travel.com/wbsapi/RequestListenerServlet")
https = Net::HTTP.new(uri.host, uri.port)
https.use_ssl = true
req, body = https.post(uri.path, '<xml><blah></blah></xml>', {"Accept-Encoding" => 'gzip'})
puts "#{body.size} bytes received."
req.each{ |h,v| puts "#{h}: #{v}" }
Look at Typhoeus as an alternate, and, in my opinion, easier to use gem, especially the "Making Quick Requests" section.

Resources