How to get redirect log in Mechanize? - ruby

In ruby, if you use mechanize following 301/302 redirects like this
require 'mechanize'
m = WWW::Mechanize.new
m.get('http://google.com')
how to get the list of the pages mechanize was redirected through? (Like http://google.com => http://www.google.com => http://google.com.ua)
OK, here is the code in mechanize responsible for redirection
elsif res_klass <= Net::HTTPRedirection
return page unless follow_redirect?
log.info("follow redirect to: #{ response['Location'] }") if log
from_uri = page.uri
raise RedirectLimitReachedError.new(page, redirects) if redirects + 1 > redirection_limit
redirect_verb = options[:verb] == :head ? :head : :get
page = fetch_page( :uri => response['Location'].to_s,
:referer => page,
:params => [],
:verb => redirect_verb,
:redirects => redirects + 1
)
#history.push(page, from_uri)
return page
but trying to m.history.map {|p| puts p.uri} shows 3 times the uri of last page..

The key here is to take advantage of the built in logging in Mechanize. Here's a full code sample using the built in Rails logging facilities.
require 'mechanize'
require 'logger'
mechanize_logger = Logger.new('log/mechanize.log')
mechanize_logger.level = Logger::INFO
url = 'http://google.com'
agent = Mechanize.new
agent.log = mechanize_logger
agent.get(url)
And then check the output of log/mechanize.log in your log directory and you'll see the whole mechanize process including the intermediate urls.

I'm not certain, but here are a couple of things to try:
see what's in m.history[i].uri after the get()
You might need something like:
for m.redirection_limit in 0..99
begin
m.get(url)
break
rescue WWW::Mechanize::RedirectLimitReachedError
# code here could get control at
# intermediate redirection levels
end
end

Related

How do i resolve an HTTP500 Error while web scraping with Mechanize in ruby?

I want to retrieve my driving license number, issue_date, and expiry_date from this website("https://sarathi.nic.in:8443/nrportal/sarathi/HomePage.jsp"). When I try to fetch it, I get the error Mechanize::ResponseCodeError: 500 => Net::HTTPInternalServerError for https://sarathi.nic.in:8443/nrportal/sarathi/DlDetRequest.jsp -- unhandled response.
This is the code that I wrote to scrape:
require 'mechanize'
require 'logger'
require 'nokogiri'
require 'open-uri'
require 'openssl'
OpenSSL::SSL::VERIFY_PEER = OpenSSL::SSL::VERIFY_NONE
agent = Mechanize.new
agent.log = Logger.new "mech.log"
agent.user_agent_alias = 'Mac Safari 4'
Mechanize.new.get("https://sarathi.nic.in:8443/nrportal/sarathi/HomePage.jsp")
page=agent.get('https://sarathi.nic.in:8443/nrportal/sarathi/HomePage.jsp') # opening home page.
page = agent.page.links.find { |l| l.text == 'Status of Licence' }.click # click the link.
page.forms_with(:name=>"dlform").first.field_with(:name=>"dlform:DLNumber").value="TN3‌​8 20120001119" #user input to text field.
page.form_with(:name=>"dlform").field_with(:name=>"javax.faces.ViewState").value="SUBMIT" #submit button value assigning.
page.form(:name=>"dlform",:action=>"/nrportal/sarathi/DlDetRequest.jsp") #to specify the form i need.
agent.cookie_jar.clear!
gg=agent.submit page.forms.last #submitting my form
It isn't working since you are clearing off the cookies before submitting the form, hence removing all the input data you provided. I could get it working by removing it simply as:
...
page.forms_with(:name=>"dlform").first.field_with(:name=>"dlform:DLNumber").value="TN3‌​8 20120001119" #user input to text field
form = page.form(:name=>"dlform",:action=>"/nrportal/sarathi/DlDetRequest.jsp")
gg = agent.submit form, form.buttons.first
Note that you do not need to set the value for #submit button, rather pass the submit button while form submission itself.

How do I follow URL redirection?

I have a URL and I need to retrieve the URL it redirects to (the number of redirections is arbitrary).
One real example I'm working on is:
https://www.google.com/url?q=http://m.zynga.com/about/privacy-center/privacy-policy&sa=D&usg=AFQjCNESJyXBeZenALhKWb52N1vHouAd5Q
which will eventually redirect to:
http://company.zynga.com/privacy/policy
which is the URL I'm interested in.
I tried with open-uri as follows:
privacy_url = "https://www.google.com/url?q=http://m.zynga.com/about/privacy-center/privacy-policy&sa=D&usg=AFQjCNESJyXBeZenALhKWb52N1vHouAd5Q"
final_url = nil
open(privacy_url) do |h|
puts "Redirecting to #{h.base_uri}"
final_url = h.base_uri
end
but I keep getting the original URL back, meaning that final_url is equal to privacy_url.
Is there any way to follow this kind of redirection and programmatically access the resulting URL?
I finally made it, using the Mechanize gem. They key is to enable the follow_meta_refresh options, which is disabled by default.
Here's how
require 'mechanize'
browser = Mechanize.new
browser.follow_meta_refresh = true
start_url = "https://www.google.com/url?q=http://m.zynga.com/about/privacy-center/privacy-policy&sa=D&usg=AFQjCNESJyXBeZenALhKWb52N1vHouAd5Q"
final_url = nil
browser.get(start_url) do |page|
final_url = page.uri.to_s
end
puts final_url # => http://company.zynga.com/privacy/policy

Detect redirect with ruby mechanize

I am using the mechanize/nokogiri gems to parse some random pages. I am having problems with 301/302 redirects. Here is a snippet of the code:
agent = Mechanize.new
page = agent.get('http://example.com/page1')
The test server on mydomain.com will redirect the page1 to page2 with 301/302 status code, therefore I was expecting to have
page.code == "301"
Instead I always get page.code == "200".
My requirements are:
I want redirects to be followed (default mechanize behavior, which is good)
I want to be able to detect that page was actually redirected
I know that I can see the page1 in agent.history, but that's not reliable. I want the redirect status code also.
How can I achieve this behavior with mechanize?
You could leave redirect off and just keep following the location header:
agent.redirect_ok = false
page = agent.get 'http://www.google.com'
status_code = page.code
while page.code[/30[12]/]
page = agent.get page.header['location']
end
I found a way to allow redirects and also get the status code, but I'm not sure it's the best method.
agent = Mechanize.new
# deactivate redirects first
agent.redirect_ok = false
status_code = '200'
error_occurred = false
# request url
begin
page = agent.get(url)
status_code = page.code
rescue Mechanize::ResponseCodeError => ex
status_code = ex.response_code
error_occurred = true
end
if !error_occurred && status_code != '200' then
# enable redirects and request the page again
agent.redirect_ok = true
page = agent.get(url)
end

login vk.com net::http.post_form

I want login to vk.com or m.vk.com without Ruby. But my code dosen't work.
require 'net/http'
email = "qweqweqwe#gmail.com"
pass = "qeqqweqwe"
userUri = URI('m.vk.com/index.html')
Net::HTTP.get(userUri)
res = Net::HTTP.post_form(userUri, 'email' => email, 'pass' => pass)
puts res.body
First of all, you need to change userUri to the following:
userUri = URI('https://login.vk.com/?act=login')
Which is where the vk site expects your login parameters.
I'm not very faimilar with vk, but you probably need a way to handle the session cookie. Both receiving it, and providing it for future requests. Can you elaborate on what you're doing after login?
Here is the net/http info for cookie handling:
# Headers
res['Set-Cookie'] # => String
res.get_fields('set-cookie') # => Array
res.to_hash['set-cookie'] # => Array
puts "Headers: #{res.to_hash.inspect}"
This kind of task is exactly what Mechanize is for. Mechanize handles redirects and cookies automatically. You can do something like this:
require 'mechanize'
agent = Mechanize.new
url = "http://m.vk.com/login/"
page = agent.get(url)
form = page.forms[0]
form['email'] = "qweqweqwe#gmail.com"
form['pass'] = "qeqqweqwe"
form.submit
puts agent.page.body

Ruby Mechanize: Follow a Link

In Mechanize on Ruby, I have to assign a new variable to every new page I come to. For example:
page2 = page1.link_with(:text => "Continue").click
page3 = page2.link_with(:text => "About").click
...etc
Is there a way to run Mechanize without a variable holding every page state? like
my_only_page.link_with(:text => "Continue").click!
my_only_page.link_with(:text => "About").click!
I don't know if I understand your question correctly, but if it's a matter of looping through a lot of pages dynamically and process them, you could do it like this:
require 'mechanize'
url = "http://example.com"
agent = Mechanize.new
page = agent.get(url) #Get the starting page
loop do
# What you want to do on the page - ex. extract something...
item = page.parser.css('.some_item').text
item.save
if link = page.link_with(:text => "Continue") # As long as there is still a nextpage link...
page = link.click
else # If no link left, then break out of loop
break
end
end

Resources