Test existance of a page using mechanize - ruby

I want to test if an url exist before downloading it
I usully do this
agent=Mechanize.New
page=agent.get("www.some_url.com/atributes")
but insted of that I want to test if a page is attributed to that url before downloading it

The only way to see if a page exists (and that you can reach it via the internet) is to perform an actual request. You could first do a HTTP HEAD request, which only requests the headers, not the actual content:
url = "www.some_url.com/atributes"
agent = Mechanize.New
begin
agent.head(url)
page_exists = true
rescue SocketError
page_exists = false
end
if page_exists
page = agent.get(url)
# do something with page ...
end
But then again, you can just get rid of the extra request and rescue from errors directly with the GET request:
url = "www.some_url.com/atributes"
agent = Mechanize.New
begin
page = agent.get(url)
# do something with page ...
rescue SocketError
puts "There is no such page."
end

Related

Mechanize: Avoid webserver locking

I have this code:
#!/bin/env ruby
# encoding: utf-8
require 'mechanize'
begin
agent = Mechanize.new
agent.robots = false
agent.user_agent_alias = 'Mac Safari'
url = "http://www.paris.cl/tienda/es/paris/computacion/tablet/tablet-acer--iconia-b1-710-l688-7-342750-ppp-"
website = agent.get(url)
rescue Exception => e
puts "Error : " + e.message
end
This try to get a website, but I get this error:
Error : 403 => Net::HTTPForbidden for http://www.paris.cl/tienda/es/paris/computacion/tablet/tablet-acer--iconia-b1-710-l688-7-342750-ppp- -- unhandled response
The webserver blocks me (before I can get the website),I try changing the IP, but nothing happend.
Exists any form to avoid this lock? (Also I don't know which type of lock is this)
Grettings
Your code works for me, apparently you kicked their resource a few times as unprotected user (without proper HTTP headers) and they've blocked your IP.
happens to the best of us :)

How do I follow URL redirection?

I have a URL and I need to retrieve the URL it redirects to (the number of redirections is arbitrary).
One real example I'm working on is:
https://www.google.com/url?q=http://m.zynga.com/about/privacy-center/privacy-policy&sa=D&usg=AFQjCNESJyXBeZenALhKWb52N1vHouAd5Q
which will eventually redirect to:
http://company.zynga.com/privacy/policy
which is the URL I'm interested in.
I tried with open-uri as follows:
privacy_url = "https://www.google.com/url?q=http://m.zynga.com/about/privacy-center/privacy-policy&sa=D&usg=AFQjCNESJyXBeZenALhKWb52N1vHouAd5Q"
final_url = nil
open(privacy_url) do |h|
puts "Redirecting to #{h.base_uri}"
final_url = h.base_uri
end
but I keep getting the original URL back, meaning that final_url is equal to privacy_url.
Is there any way to follow this kind of redirection and programmatically access the resulting URL?
I finally made it, using the Mechanize gem. They key is to enable the follow_meta_refresh options, which is disabled by default.
Here's how
require 'mechanize'
browser = Mechanize.new
browser.follow_meta_refresh = true
start_url = "https://www.google.com/url?q=http://m.zynga.com/about/privacy-center/privacy-policy&sa=D&usg=AFQjCNESJyXBeZenALhKWb52N1vHouAd5Q"
final_url = nil
browser.get(start_url) do |page|
final_url = page.uri.to_s
end
puts final_url # => http://company.zynga.com/privacy/policy

Detect redirect with ruby mechanize

I am using the mechanize/nokogiri gems to parse some random pages. I am having problems with 301/302 redirects. Here is a snippet of the code:
agent = Mechanize.new
page = agent.get('http://example.com/page1')
The test server on mydomain.com will redirect the page1 to page2 with 301/302 status code, therefore I was expecting to have
page.code == "301"
Instead I always get page.code == "200".
My requirements are:
I want redirects to be followed (default mechanize behavior, which is good)
I want to be able to detect that page was actually redirected
I know that I can see the page1 in agent.history, but that's not reliable. I want the redirect status code also.
How can I achieve this behavior with mechanize?
You could leave redirect off and just keep following the location header:
agent.redirect_ok = false
page = agent.get 'http://www.google.com'
status_code = page.code
while page.code[/30[12]/]
page = agent.get page.header['location']
end
I found a way to allow redirects and also get the status code, but I'm not sure it's the best method.
agent = Mechanize.new
# deactivate redirects first
agent.redirect_ok = false
status_code = '200'
error_occurred = false
# request url
begin
page = agent.get(url)
status_code = page.code
rescue Mechanize::ResponseCodeError => ex
status_code = ex.response_code
error_occurred = true
end
if !error_occurred && status_code != '200' then
# enable redirects and request the page again
agent.redirect_ok = true
page = agent.get(url)
end

Parse html GET via open() with nokogiri - redirect exception

I'm trying to learn ruby, so I'm following an exercise of google dev. I'm trying to parse some links. In the case of successful redirection (considering that I know that it its possible only to get redirected once), I get redirect forbidden. I noticed that I go from a http protocol link to an https protocol link. Any concrete idea how could I implement in this in ruby because google's exercise is for python?
error:
ruby fix.rb
redirection forbidden: http://code.google.com/edu/languages/google-python-class/images/puzzle/p-bija-baei.jpg -> https://developers.google.com/edu/python/images/puzzle/p-bija-baei.jpg?csw=1
code that should achieve what I'm looking for:
def acquireData(urls, imgs) #List item urls list of valid urls !checked, imgs list of the imgs I'll download afterwards.
begin
urls.each do |url|
page = Nokogiri::HTML(open(url))
puts page.body
end
rescue Exception => e
puts e
end
end
Ruby's OpenURI will automatically handle redirects for you, as long as they're not "meta-refresh" that occur inside the HTML itself.
For instance, this follows a redirect automatically:
irb(main):008:0> page = open('http://www.example.org')
#<StringIO:0x00000002ae2de0>
irb(main):009:0> page.base_uri.to_s
"http://www.iana.org/domains/example"
In other words, the request to "www.example.org" got redirected to "www.iana.org" and OpenURI tracked it correctly.
If you are trying to learn HOW to handle redirects, read the Net::HTTP documentation. Here is the example how to do it from the document:
Following Redirection
Each Net::HTTPResponse object belongs to a class for its response code.
For example, all 2XX responses are instances of a Net::HTTPSuccess subclass, a 3XX response is an instance of a Net::HTTPRedirection subclass and a 200 response is an instance of the Net::HTTPOK class. For details of response classes, see the section “HTTP Response Classes” below.
Using a case statement you can handle various types of responses properly:
def fetch(uri_str, limit = 10)
# You should choose a better exception.
raise ArgumentError, 'too many HTTP redirects' if limit == 0
response = Net::HTTP.get_response(URI(uri_str))
case response
when Net::HTTPSuccess then
response
when Net::HTTPRedirection then
location = response['location']
warn "redirected to #{location}"
fetch(location, limit - 1)
else
response.value
end
end
print fetch('http://www.ruby-lang.org')
If you want to handle meta-refresh statements, reflect on this:
require 'nokogiri'
doc = Nokogiri::HTML(%[<meta http-equiv="refresh" content="5;URL='http://example.com/'">])
meta_refresh = doc.at('meta[http-equiv="refresh"]')
if meta_refresh
puts meta_refresh['content'][/URL=(.+)/, 1].gsub(/['"]/, '')
end
Which outputs:
http://example.com/
Basically the url in code.google that you're trying to open redirects to a https url. You can see that by yourself if you paste http://code.google.com/edu/languages/google-python-class/images/puzzle/p-bija-baei.jpg into your browser
Check the following bug report that explains why open-uri can't redirect to https;
So the solution to your problem is simply: use a different set of urls (that don't redirect to https)

Ruby Net::HTTP - following 301 redirects

My users submit urls (to mixes on mixcloud.com) and my app uses them to perform web requests.
A good url returns a 200 status code:
uri = URI.parse("http://www.mixcloud.com/ErolAlkan/hard-summer-mix/")
request = Net::HTTP.get_response(uri)(
#<Net::HTTPOK 200 OK readbody=true>
But if you forget the trailing slash then our otherwise good url returns a 301:
uri = "http://www.mixcloud.com/ErolAlkan/hard-summer-mix"
#<Net::HTTPMovedPermanently 301 MOVED PERMANENTLY readbody=true>
The same thing happens with 404's:
# bad path returns a 404
"http://www.mixcloud.com/bad/path/"
# bad path minus trailing slash returns a 301
"http://www.mixcloud.com/bad/path"
How can I 'drill down' into the 301 to see if it takes us on to a valid resource or an error page?
Is there a tool that provides a comprehensive overview of the rules that a particular domain might apply to their urls?
301 redirects are fairly common if you do not type the URL exactly as the web server expects it. They happen much more frequently than you'd think, you just don't normally ever notice them while browsing because the browser does all that automatically for you.
Two alternatives come to mind:
1: Use open-uri
open-uri handles redirects automatically. So all you'd need to do is:
require 'open-uri'
...
response = open('http://xyz...').read
If you have trouble redirecting between HTTP and HTTPS, then have a look here for a solution:
Ruby open-uri redirect forbidden
2: Handle redirects with Net::HTTP
def get_response_with_redirect(uri)
r = Net::HTTP.get_response(uri)
if r.code == "301"
r = Net::HTTP.get_response(URI.parse(r['location']))
end
r
end
If you want to be even smarter you could try to add or remove missing backslashes to the URL when you get a 404 response. You could do that by creating a method like get_response_smart which handles this URL fiddling in addition to the redirects.
I can't figure out how to comment on the accepted answer (this question might be closed), but I should note that r.header is now obsolete, so r.header['location'] should be replaced by r['location'] (per https://stackoverflow.com/a/6934503/1084675 )
rest-client follows the redirections for GET and HEAD requests without any additional configuration. It works very nice.
for result codes between 200 and 207, a RestClient::Response will be returned
for result codes 301, 302 or 307, the redirection will be followed if the request is a GET or a HEAD
for result code 303, the redirection will be followed and the request transformed into a GET
example of usage:
require 'rest-client'
RestClient.get 'http://example.com/resource'
The rest-client README also gives an example of following redirects with POST requests:
begin
RestClient.post('http://example.com/redirect', 'body')
rescue RestClient::MovedPermanently,
RestClient::Found,
RestClient::TemporaryRedirect => err
err.response.follow_redirection
end
Here is the code I came up with (derived from different examples) which will bail out if there are too many redirects (note that ensure_success is optional):
require "net/http"
require "uri"
class Net::HTTPResponse
def ensure_success
unless kind_of? Net::HTTPSuccess
warn "Request failed with HTTP #{#code}"
each_header do |h,v|
warn "#{h} => #{v}"
end
abort
end
end
end
def do_request(uri_string)
response = nil
tries = 0
loop do
uri = URI.parse(uri_string)
http = Net::HTTP.new(uri.host, uri.port)
request = Net::HTTP::Get.new(uri.request_uri)
response = http.request(request)
uri_string = response['location'] if response['location']
unless response.kind_of? Net::HTTPRedirection
response.ensure_success
break
end
if tries == 10
puts "Timing out after 10 tries"
break
end
tries += 1
end
response
end
Not sure if anyone is looking for this exact solution, but if you are trying to download an image http/https and store it to a variable
require 'open_uri_redirections'
require 'net/https'
web_contents = open('file_url_goes_here', :ssl_verify_mode => OpenSSL::SSL::VERIFY_NONE, :allow_redirections => :all) {|f| f.read }
puts web_contents

Resources