In Mechanize (Ruby), how to login then scrape? [duplicate] - ruby

This question already has an answer here:
How to fill out login form with mechanize in Ruby?
(1 answer)
Closed 8 years ago.
My aim: On ROR 3, get a PDF file from a site which requires you to login before you can download it
My method, using Mechanize:
Step 1: log in
Step 2: since I'm logged in, get the PDF link
Thing is, when I debug and click on the link scraped, I'm redirected to the login page instead of getting the file
There are the 2 controls that I did on step 1:
(...)
search_results = form.submit
puts search_results.body
=> {"succes":true,"URL":"/sso/inscription/"}
Apparently the login succeed
puts agent.cookie_jar.jar
=> I could find the information about my session, si I guess that cookies are saved
Any hint about what I did wrong ?
(could be important: on the site, when you login into "http://elwatan.com/sso/inscription/inscription_payant.php", you are redirected to the home page (elwatan.com)
Below my code:
# step 1, login:
agent = Mechanize.new
page = agent.get("http://elwatan.com/sso/inscription/inscription_payant.php")
form = page.form_with(:id => 'form-login-page')
form.login = "my_mail"
form.password = "my_pasword"
search_results = form.submit
# step 2, get the PDF:
#watan = {}
page.parser.xpath('//th/a').each do |link|
puts #watan[link.text.strip] = link['href']
end

The agent variable retains the session and cookies.
So you first do your login, as you did, and then you write agent.get(---your-pdf-link-here--).
In your example code is a small error: the result of the submit is in search_results and then you continue to use page to search for the links?
So in your case, I guess it should look like (untested of course) :
# step 1, login:
agent = Mechanize.new
agent.pluggable_parser.pdf = Mechanize::FileSaver
page = agent.get("http://elwatan.com/sso/inscription/inscription_payant.php")
form = page.form_with(:id => 'form-login-page')
form.login = "my_mail"
form.password = "my_pasword"
page = form.submit
# step 2, get the PDF:
page.parser.xpath('//th/a').each do |link|
agent.get link['href']
end

page variable doesn't update after submit, link click, etc.
You need either work with page returned after submit:
agent = Mechanize.new
page = agent.get("http://elwatan.com/sso/inscription/inscription_payant.php")
form = page.form_with(:id => 'form-login-page')
form.login = "my_mail"
form.password = "my_pasword"
page = form.submit
Or manually get a new page:
agent = Mechanize.new
page = agent.get("http://elwatan.com/sso/inscription/inscription_payant.php")
form = page.form_with(:id => 'form-login-page')
form.login = "my_mail"
form.password = "my_pasword"
form.submit
page2 = agent.get('http://...')

Related

Rails ruby-mechanize how to get a page after redirection

I want to collect manufacturers and their medicine details from http://www.mims.com/India/Browse/Alphabet/All?cat=Company&tab=company.
Mechanize gem is used to extract content from html page with help of ryan Tutorial
I can login successfully but couldn't reach desination page http://www.mims.com/India/Browse/Alphabet/All?cat=Company&tab=company.
I have tried so far
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'mechanize'
agent = Mechanize.new
agent.user_agent = 'Individueller User-Agent'
agent.user_agent_alias = 'Linux Mozilla'
agent.get("https://sso.mims.com/Account/SignIn") do |page|
#login_page = a.click(page.link_with(:text => /Login/))
# Submit the login form
login_page = page.form_with(:action => '/') do |f|
f.SignInEmailAddress = 'username of mims'
f.SignInPassword = 'secret'
end.click_button
url = 'http://www.mims.com/India/Browse/Alphabet/A?cat=drug'
page = agent.get url # here checking authentication if success then redirecting to destination
p page
end
Note: I have shared dummy login credential for your testing
After clicks on 'CompaniesBrowse Company Directory' link, page redirecting with flash message "you are redirecting...", Mechanize gem caches this page.
Question:
1) How to get the original page(after redirection).
I found problem cases that MIMS site auto submit form with page onload callback for checking authentication. It is not working with machanize gem.
Solution
Manually submitting the form two times solves this issue. Example
url = 'http://www.mims.com/India/Browse/Alphabet/A?cat=drug'
page = agent.get url # here checking authentication if success then redirecting to destination
p page
page.form.submit
agent.page.form.submit

Detect redirect with ruby mechanize

I am using the mechanize/nokogiri gems to parse some random pages. I am having problems with 301/302 redirects. Here is a snippet of the code:
agent = Mechanize.new
page = agent.get('http://example.com/page1')
The test server on mydomain.com will redirect the page1 to page2 with 301/302 status code, therefore I was expecting to have
page.code == "301"
Instead I always get page.code == "200".
My requirements are:
I want redirects to be followed (default mechanize behavior, which is good)
I want to be able to detect that page was actually redirected
I know that I can see the page1 in agent.history, but that's not reliable. I want the redirect status code also.
How can I achieve this behavior with mechanize?
You could leave redirect off and just keep following the location header:
agent.redirect_ok = false
page = agent.get 'http://www.google.com'
status_code = page.code
while page.code[/30[12]/]
page = agent.get page.header['location']
end
I found a way to allow redirects and also get the status code, but I'm not sure it's the best method.
agent = Mechanize.new
# deactivate redirects first
agent.redirect_ok = false
status_code = '200'
error_occurred = false
# request url
begin
page = agent.get(url)
status_code = page.code
rescue Mechanize::ResponseCodeError => ex
status_code = ex.response_code
error_occurred = true
end
if !error_occurred && status_code != '200' then
# enable redirects and request the page again
agent.redirect_ok = true
page = agent.get(url)
end

Using Ruby and Mechanize to fill in a remote login form mystery

I am trying to implement a Ruby script that will take in a username and password, then proceed to fill in the account details on a login form on another website and return the then follow a link and retrieve the account history. To do this I am using the Mechanize gem.
I have been following the examples here
but still I cant seem to get it to work. I have simplified this down greatly to try get it to work in parts but a supposedly simple filling in a form is holding me up.
Here is my code:
# script gets called with a username and password for the site
require 'mechanize'
#create a mechanize instant
agent = Mechanize.new
agent.get('https://mysite/Login.aspx') do |login_page|
#fill in the login form on the login page
loggedin_page = login_page.form_with(:id => 'form1') do |form|
username_field = form.field_with(:id => 'ContentPlaceHolder1_UserName')
username_field.value = ARGV[0]
password_field = form.field_with(:id => 'ContentPlaceHolder1_Password')
password_field.value = ARGV[1]
button = form.button_with(:id => 'ContentPlaceHolder1_btnlogin')
end.submit(form , button)
#click the View my history link
#account_history_page = loggedin_page.click(home_page.link_with(:text => "View My History"))
####TEST to see if i am actually making it past the login page
#### and that the View My History link is now visible amongst the other links on the page
loggedin_page.links.each do |link|
text = link.text.strip
next unless text.length > 0
puts text if text == "View My History"
end
##TEST
end
Terminal error message:
stackqv2.rb:19:in `block in <main>': undefined local variable or method `form' for main:Object (NameError)
from /usr/local/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize.rb:409:in `get'
from stackqv2.rb:8:in `<main>'
You don't need to pass form as an argument to submit. The button is also optional. Try using the following:
loggedin_page = login_page.form_with(:id => 'form1') do |form|
username_field = form.field_with(:id => 'ContentPlaceHolder1_UserName')
username_field.value = ARGV[0]
password_field = form.field_with(:id => 'ContentPlaceHolder1_Password')
password_field.value = ARGV[1]
end.submit
If you really do need to specify which button is used to submit the form, try this:
form = login_page.form_with(:id => 'form1')
username_field = form.field_with(:id => 'ContentPlaceHolder1_UserName')
username_field.value = ARGV[0]
password_field = form.field_with(:id => 'ContentPlaceHolder1_Password')
password_field.value = ARGV[1]
button = form.button_with(:id => 'ContentPlaceHolder1_btnlogin')
loggedin_page = form.submit(button)
It's a matter of scope:
page.form do |form|
# this block has its own scope
form['foo'] = 'bar' # <- ok, form is defined inside this block
end
puts form # <- error, form is not defined here
ramblex's suggestion is to not use a block with your form and I agree, it's less confusing that way.

Mechanize Page.Form.Action POST for multiple INPUT tags with same NAME / VALUE

Need to post to existing web page (no login required) and post parameters for submit where multiple submit forms tags exist and contains identical tags with the same NAME and VALUE tags; for example, on the same page this INPUT Submit is repeated 3 times under different FORM tags:
< INPUT TYPE='Submit' NAME='submit_button' VALUE='Submit Query' >
My Ruby code runs ok for identifying the fields on the form tags, but fails on the page.forms[x].action post with 405 HTTPMethodNotAllowed for https://pdb.nipr.com/html/PacNpnSearch -- unhandled response.
Ruby code:
class PostNIPR2
def post(url)
button_count = 0
agent = Mechanize.new
page = agent.get(url)
page.forms.each do |form|
form.buttons.each do |button|
if(button.value == 'Submit Query')
button_count = button_count + 1
if (button_count == 3)
btn_submit_license = button.name
puts button
puts btn_submit_license
puts button.value
end
end
end
end
begin
uform = page.forms[1]
uform.license = "0H20649"
uform.state = "CA"
uform.action = 'https://pdb.nipr.com/html/PacNpnSearch'
rescue Exception => e
error_page = e.page
end
page = agent.submit(uform)
end
url = "https://pdb.nipr.com/html/PacNpnSearch.html"
p = PostNIPR2.new
p.post(url)
end
Is your question how to select that button? If so:
form.button_with(:name => 'submit_button')
or submit the form like this:
next_page = form.submit form.button_with(:name => 'submit_button')
Also you are changing the form's action for some reason and that will explain the 405s
You are correct, sorry about the comment code - the question was to have the form.license and form.state updated with the input params then have the form.submit post the form.button_with(:name => 'Submit Query' - I did this and received the 405 HTTPMethodNotAllowed, while for https://pdb.nipr.com/html/PacNpnSearch -- unhandled response. But now I have changed the code to agent.page.form_with(:name => 'license_form') which now correctly finds the form I need to post to; then I get the form.button_with(:value => 'Submit Query') and then utilize the agent.submit(form, button). Now I get the correct result.

Ruby Mechanize: Follow a Link

In Mechanize on Ruby, I have to assign a new variable to every new page I come to. For example:
page2 = page1.link_with(:text => "Continue").click
page3 = page2.link_with(:text => "About").click
...etc
Is there a way to run Mechanize without a variable holding every page state? like
my_only_page.link_with(:text => "Continue").click!
my_only_page.link_with(:text => "About").click!
I don't know if I understand your question correctly, but if it's a matter of looping through a lot of pages dynamically and process them, you could do it like this:
require 'mechanize'
url = "http://example.com"
agent = Mechanize.new
page = agent.get(url) #Get the starting page
loop do
# What you want to do on the page - ex. extract something...
item = page.parser.css('.some_item').text
item.save
if link = page.link_with(:text => "Continue") # As long as there is still a nextpage link...
page = link.click
else # If no link left, then break out of loop
break
end
end

Resources