Mechanize select ids with suffix - ruby

i'm trying to scrape some urls on a page with mechanize. I use link_with(:id=>''). Each id have the same name but a different number suffix. My code
require 'mechanize'
m = Mechanize.new
results = m.get(website_url)
listing_link = results.link_with(:id => "listing-1234-56")
click_link = listing_link.click
How can i click on each link with id="listing-XXXX-XX" ? thx

You could do:
results.link_with(:id => /^listing-/)

Related

How do i resolve an HTTP500 Error while web scraping with Mechanize in ruby?

I want to retrieve my driving license number, issue_date, and expiry_date from this website("https://sarathi.nic.in:8443/nrportal/sarathi/HomePage.jsp"). When I try to fetch it, I get the error Mechanize::ResponseCodeError: 500 => Net::HTTPInternalServerError for https://sarathi.nic.in:8443/nrportal/sarathi/DlDetRequest.jsp -- unhandled response.
This is the code that I wrote to scrape:
require 'mechanize'
require 'logger'
require 'nokogiri'
require 'open-uri'
require 'openssl'
OpenSSL::SSL::VERIFY_PEER = OpenSSL::SSL::VERIFY_NONE
agent = Mechanize.new
agent.log = Logger.new "mech.log"
agent.user_agent_alias = 'Mac Safari 4'
Mechanize.new.get("https://sarathi.nic.in:8443/nrportal/sarathi/HomePage.jsp")
page=agent.get('https://sarathi.nic.in:8443/nrportal/sarathi/HomePage.jsp') # opening home page.
page = agent.page.links.find { |l| l.text == 'Status of Licence' }.click # click the link.
page.forms_with(:name=>"dlform").first.field_with(:name=>"dlform:DLNumber").value="TN3‌​8 20120001119" #user input to text field.
page.form_with(:name=>"dlform").field_with(:name=>"javax.faces.ViewState").value="SUBMIT" #submit button value assigning.
page.form(:name=>"dlform",:action=>"/nrportal/sarathi/DlDetRequest.jsp") #to specify the form i need.
agent.cookie_jar.clear!
gg=agent.submit page.forms.last #submitting my form
It isn't working since you are clearing off the cookies before submitting the form, hence removing all the input data you provided. I could get it working by removing it simply as:
...
page.forms_with(:name=>"dlform").first.field_with(:name=>"dlform:DLNumber").value="TN3‌​8 20120001119" #user input to text field
form = page.form(:name=>"dlform",:action=>"/nrportal/sarathi/DlDetRequest.jsp")
gg = agent.submit form, form.buttons.first
Note that you do not need to set the value for #submit button, rather pass the submit button while form submission itself.

Ruby mechanize Form

Is there anyway to copy the out put of forms available to a file, like
require 'mechanize'
agent = Mechanize.new
page = agent.get("http://www.google.com")
search = page.form_with(:action => "/search")
I want to store the result/output which is shown in irb of the "search" to a file?

Why is Mechanize not following the link

I am trying to follow a link with Mechanize but it does not seem to be working, syntax appears to be correct, am I referencing this incorrectly or do I need to do something else?
Problem area
agent.page.links_with(:text => 'VG278H')[2].click
Full Code
require 'rubygems'
require 'mechanize'
require 'open-uri'
agent = Mechanize.new
agent.get ("http://icecat.biz/en/")
#Show all form fields belonging to the first form
form = agent.page.forms[0].fields
#Enter VG278H into the text box lookup_text, submit the data
agent.page.forms[0]["lookup_text"] = "VG278H"
agent.page.forms[0].submit #Results of this is stored in Mechanize agent.page object
#Call agent.page with our results and assign them to a variable page
page = agent.page
agent.page.links_with(:text => 'VG278H')[2].click
doc = page.parser
puts doc
You should grab a copy of Charles (http://www.charlesproxy.com/) or something that allows you to watch what happens when you submit the form from your browser. Anyway, your problem is that this part:
agent.page.forms[0]["lookup_text"] = "VG278H"
agent.page.forms[0].submit
is returning an html fragment that looks like this:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><script>self.location.href="http://icecat.us/index.cgi?language=en&new_search=1&lookup_text=VG278H"</script>
So you actually need to call this directly or scrap out the the self.location.href and have your agent perform a get:
page = agent.get("http://icecat.us/index.cgi?language=en&new_search=1&lookup_text=VG278H")
If you were going to do that, this works:
require 'rubygems'
require 'mechanize'
require 'open-uri'
agent = Mechanize.new
agent.get ("http://icecat.biz/en/")
page = agent.get("http://icecat.us/index.cgi?language=en&new_search=1&lookup_text=VG278H")
page = page.links_with(:text => 'VG278H')[2].click
doc = page.parser
puts doc
Happy scraping

Ruby Mechanize: Follow a Link

In Mechanize on Ruby, I have to assign a new variable to every new page I come to. For example:
page2 = page1.link_with(:text => "Continue").click
page3 = page2.link_with(:text => "About").click
...etc
Is there a way to run Mechanize without a variable holding every page state? like
my_only_page.link_with(:text => "Continue").click!
my_only_page.link_with(:text => "About").click!
I don't know if I understand your question correctly, but if it's a matter of looping through a lot of pages dynamically and process them, you could do it like this:
require 'mechanize'
url = "http://example.com"
agent = Mechanize.new
page = agent.get(url) #Get the starting page
loop do
# What you want to do on the page - ex. extract something...
item = page.parser.css('.some_item').text
item.save
if link = page.link_with(:text => "Continue") # As long as there is still a nextpage link...
page = link.click
else # If no link left, then break out of loop
break
end
end

Ruby - Mechanize: Select link by classname and other questions

At the moment I'm having a look on Mechanize.
I am pretty new to Ruby, so please be patient.
I wrote a little test script:
require 'rubygems'
require 'mechanize'
agent = WWW::Mechanize.new
page = agent.get('http://www.google.de')
pp page.title
google_form = page.form_with(:name => 'f')
google_form.q = 'test'
page = agent.submit(google_form)
pp page.title
page_links = Array.new
page.links.each do |ll|
page_links << ll
end
puts page_links.size
This works. But page_links includes not only the search results. It also includes the google links like Login, Pictures, ...
The result links own a styleclass "1". Is it possible to select only the links with class == 1? How do I achieve this?
Is it possible to modify the "agentalias"? If I own a website, including google analytics or something, what browserclient will I see in ga going with mechanize on my site?
Can I select elements by their ID instead of their name? I tried to use
my_form = page.form_with(:id => 'myformid')
But this does not work.
in such cases like your I am using Nokogiri DOM search.
Here is your code a little bit rewritten:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.google.de')
pp page.title
google_form = page.form_with(:name => 'f')
google_form.q = 'test'
page = agent.submit(google_form)
pp page.title
page_links = Array.new
#maybe you better use 'h3.r > a.l' here
page.parser.css("a.l").each do |ll|
#page.parser here is Nokogiri::HTML::Document
page_links << ll
puts ll.text + "=>" + ll["href"]
end
puts page_links.size
Probably this article is a good place to start:
getting-started-with-nokogiri
By the way samples in the article also deal with Google search ;)
You can build a list of just the search result links by changing your code as follows:
page.links.each do |ll|
cls = ll.attributes.attributes['class']
page_links << ll if cls && cls.value == 'l'
end
For each element ll in page.links, ll.attributes is a Nokogiri::XML::Element and ll.attributes.attributes is a Hash containing the attributes on the link, hence the need for ll.attributes.attributes to get at the actual class and the need for the nil check before comparing the value to 'l'
The problem with using :id in the criteria to find a form is that it clashes with Ruby's Object#id method for returning a Ruby object's internal id. I'm not sure what the work around for this is. You would have no problem selecting the form by some other attribute (e.g. its action.)
I believe the selector you are looking for is:
:dom_id e.g. in your case:
my_form = page.form_with(:dom_id => 'myformid')

Resources