Detect redirect with ruby mechanize - ruby

I am using the mechanize/nokogiri gems to parse some random pages. I am having problems with 301/302 redirects. Here is a snippet of the code:
agent = Mechanize.new
page = agent.get('http://example.com/page1')
The test server on mydomain.com will redirect the page1 to page2 with 301/302 status code, therefore I was expecting to have
page.code == "301"
Instead I always get page.code == "200".
My requirements are:
I want redirects to be followed (default mechanize behavior, which is good)
I want to be able to detect that page was actually redirected
I know that I can see the page1 in agent.history, but that's not reliable. I want the redirect status code also.
How can I achieve this behavior with mechanize?

You could leave redirect off and just keep following the location header:
agent.redirect_ok = false
page = agent.get 'http://www.google.com'
status_code = page.code
while page.code[/30[12]/]
page = agent.get page.header['location']
end

I found a way to allow redirects and also get the status code, but I'm not sure it's the best method.
agent = Mechanize.new
# deactivate redirects first
agent.redirect_ok = false
status_code = '200'
error_occurred = false
# request url
begin
page = agent.get(url)
status_code = page.code
rescue Mechanize::ResponseCodeError => ex
status_code = ex.response_code
error_occurred = true
end
if !error_occurred && status_code != '200' then
# enable redirects and request the page again
agent.redirect_ok = true
page = agent.get(url)
end

Related

How do i resolve an HTTP500 Error while web scraping with Mechanize in ruby?

I want to retrieve my driving license number, issue_date, and expiry_date from this website("https://sarathi.nic.in:8443/nrportal/sarathi/HomePage.jsp"). When I try to fetch it, I get the error Mechanize::ResponseCodeError: 500 => Net::HTTPInternalServerError for https://sarathi.nic.in:8443/nrportal/sarathi/DlDetRequest.jsp -- unhandled response.
This is the code that I wrote to scrape:
require 'mechanize'
require 'logger'
require 'nokogiri'
require 'open-uri'
require 'openssl'
OpenSSL::SSL::VERIFY_PEER = OpenSSL::SSL::VERIFY_NONE
agent = Mechanize.new
agent.log = Logger.new "mech.log"
agent.user_agent_alias = 'Mac Safari 4'
Mechanize.new.get("https://sarathi.nic.in:8443/nrportal/sarathi/HomePage.jsp")
page=agent.get('https://sarathi.nic.in:8443/nrportal/sarathi/HomePage.jsp') # opening home page.
page = agent.page.links.find { |l| l.text == 'Status of Licence' }.click # click the link.
page.forms_with(:name=>"dlform").first.field_with(:name=>"dlform:DLNumber").value="TN3‌​8 20120001119" #user input to text field.
page.form_with(:name=>"dlform").field_with(:name=>"javax.faces.ViewState").value="SUBMIT" #submit button value assigning.
page.form(:name=>"dlform",:action=>"/nrportal/sarathi/DlDetRequest.jsp") #to specify the form i need.
agent.cookie_jar.clear!
gg=agent.submit page.forms.last #submitting my form
It isn't working since you are clearing off the cookies before submitting the form, hence removing all the input data you provided. I could get it working by removing it simply as:
...
page.forms_with(:name=>"dlform").first.field_with(:name=>"dlform:DLNumber").value="TN3‌​8 20120001119" #user input to text field
form = page.form(:name=>"dlform",:action=>"/nrportal/sarathi/DlDetRequest.jsp")
gg = agent.submit form, form.buttons.first
Note that you do not need to set the value for #submit button, rather pass the submit button while form submission itself.

Test existance of a page using mechanize

I want to test if an url exist before downloading it
I usully do this
agent=Mechanize.New
page=agent.get("www.some_url.com/atributes")
but insted of that I want to test if a page is attributed to that url before downloading it
The only way to see if a page exists (and that you can reach it via the internet) is to perform an actual request. You could first do a HTTP HEAD request, which only requests the headers, not the actual content:
url = "www.some_url.com/atributes"
agent = Mechanize.New
begin
agent.head(url)
page_exists = true
rescue SocketError
page_exists = false
end
if page_exists
page = agent.get(url)
# do something with page ...
end
But then again, you can just get rid of the extra request and rescue from errors directly with the GET request:
url = "www.some_url.com/atributes"
agent = Mechanize.New
begin
page = agent.get(url)
# do something with page ...
rescue SocketError
puts "There is no such page."
end

In Mechanize (Ruby), how to login then scrape? [duplicate]

This question already has an answer here:
How to fill out login form with mechanize in Ruby?
(1 answer)
Closed 8 years ago.
My aim: On ROR 3, get a PDF file from a site which requires you to login before you can download it
My method, using Mechanize:
Step 1: log in
Step 2: since I'm logged in, get the PDF link
Thing is, when I debug and click on the link scraped, I'm redirected to the login page instead of getting the file
There are the 2 controls that I did on step 1:
(...)
search_results = form.submit
puts search_results.body
=> {"succes":true,"URL":"/sso/inscription/"}
Apparently the login succeed
puts agent.cookie_jar.jar
=> I could find the information about my session, si I guess that cookies are saved
Any hint about what I did wrong ?
(could be important: on the site, when you login into "http://elwatan.com/sso/inscription/inscription_payant.php", you are redirected to the home page (elwatan.com)
Below my code:
# step 1, login:
agent = Mechanize.new
page = agent.get("http://elwatan.com/sso/inscription/inscription_payant.php")
form = page.form_with(:id => 'form-login-page')
form.login = "my_mail"
form.password = "my_pasword"
search_results = form.submit
# step 2, get the PDF:
#watan = {}
page.parser.xpath('//th/a').each do |link|
puts #watan[link.text.strip] = link['href']
end
The agent variable retains the session and cookies.
So you first do your login, as you did, and then you write agent.get(---your-pdf-link-here--).
In your example code is a small error: the result of the submit is in search_results and then you continue to use page to search for the links?
So in your case, I guess it should look like (untested of course) :
# step 1, login:
agent = Mechanize.new
agent.pluggable_parser.pdf = Mechanize::FileSaver
page = agent.get("http://elwatan.com/sso/inscription/inscription_payant.php")
form = page.form_with(:id => 'form-login-page')
form.login = "my_mail"
form.password = "my_pasword"
page = form.submit
# step 2, get the PDF:
page.parser.xpath('//th/a').each do |link|
agent.get link['href']
end
page variable doesn't update after submit, link click, etc.
You need either work with page returned after submit:
agent = Mechanize.new
page = agent.get("http://elwatan.com/sso/inscription/inscription_payant.php")
form = page.form_with(:id => 'form-login-page')
form.login = "my_mail"
form.password = "my_pasword"
page = form.submit
Or manually get a new page:
agent = Mechanize.new
page = agent.get("http://elwatan.com/sso/inscription/inscription_payant.php")
form = page.form_with(:id => 'form-login-page')
form.login = "my_mail"
form.password = "my_pasword"
form.submit
page2 = agent.get('http://...')

How do I follow URL redirection?

I have a URL and I need to retrieve the URL it redirects to (the number of redirections is arbitrary).
One real example I'm working on is:
https://www.google.com/url?q=http://m.zynga.com/about/privacy-center/privacy-policy&sa=D&usg=AFQjCNESJyXBeZenALhKWb52N1vHouAd5Q
which will eventually redirect to:
http://company.zynga.com/privacy/policy
which is the URL I'm interested in.
I tried with open-uri as follows:
privacy_url = "https://www.google.com/url?q=http://m.zynga.com/about/privacy-center/privacy-policy&sa=D&usg=AFQjCNESJyXBeZenALhKWb52N1vHouAd5Q"
final_url = nil
open(privacy_url) do |h|
puts "Redirecting to #{h.base_uri}"
final_url = h.base_uri
end
but I keep getting the original URL back, meaning that final_url is equal to privacy_url.
Is there any way to follow this kind of redirection and programmatically access the resulting URL?
I finally made it, using the Mechanize gem. They key is to enable the follow_meta_refresh options, which is disabled by default.
Here's how
require 'mechanize'
browser = Mechanize.new
browser.follow_meta_refresh = true
start_url = "https://www.google.com/url?q=http://m.zynga.com/about/privacy-center/privacy-policy&sa=D&usg=AFQjCNESJyXBeZenALhKWb52N1vHouAd5Q"
final_url = nil
browser.get(start_url) do |page|
final_url = page.uri.to_s
end
puts final_url # => http://company.zynga.com/privacy/policy

How to get redirect log in Mechanize?

In ruby, if you use mechanize following 301/302 redirects like this
require 'mechanize'
m = WWW::Mechanize.new
m.get('http://google.com')
how to get the list of the pages mechanize was redirected through? (Like http://google.com => http://www.google.com => http://google.com.ua)
OK, here is the code in mechanize responsible for redirection
elsif res_klass <= Net::HTTPRedirection
return page unless follow_redirect?
log.info("follow redirect to: #{ response['Location'] }") if log
from_uri = page.uri
raise RedirectLimitReachedError.new(page, redirects) if redirects + 1 > redirection_limit
redirect_verb = options[:verb] == :head ? :head : :get
page = fetch_page( :uri => response['Location'].to_s,
:referer => page,
:params => [],
:verb => redirect_verb,
:redirects => redirects + 1
)
#history.push(page, from_uri)
return page
but trying to m.history.map {|p| puts p.uri} shows 3 times the uri of last page..
The key here is to take advantage of the built in logging in Mechanize. Here's a full code sample using the built in Rails logging facilities.
require 'mechanize'
require 'logger'
mechanize_logger = Logger.new('log/mechanize.log')
mechanize_logger.level = Logger::INFO
url = 'http://google.com'
agent = Mechanize.new
agent.log = mechanize_logger
agent.get(url)
And then check the output of log/mechanize.log in your log directory and you'll see the whole mechanize process including the intermediate urls.
I'm not certain, but here are a couple of things to try:
see what's in m.history[i].uri after the get()
You might need something like:
for m.redirection_limit in 0..99
begin
m.get(url)
break
rescue WWW::Mechanize::RedirectLimitReachedError
# code here could get control at
# intermediate redirection levels
end
end

Resources